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Editor's 
Introduction 



This issue of the Digital Technical 
Journal presents papers on a range 
of computing subjects, beginning 
with recent advances in storage tech- 
nologies, followed by network router 
cluster enhancements, new desktop 
software for sharing 3-D applications 
across platforms, and an experimental 
High Performance Fortran debugger. 

DIGITAL'S storage engineers have 
been leaders in the definition of the 
parallel small computer system inter- 
face (SCSI) ANSI standards and in 
related technology improvements. 
Bill Ham's paper focuses on four 
advances in the physical features of 
SCSI that resulted in major increases 
in SCSI capabilities and minor distur- 
bances when incorporated in existing 
installations. The discussion spans 
developments from SCSI-2 through 
UltraSCSI, including speed increases 
in the synchronous data phase; longer, 
more complex configurations enabled 
by bus expanders; physical versatility 
inherent in a decreased size of die 
interconnect; and dynamic removal 
and replacement of devices on an 
active bus (hot plugging). 

The su bject of our next paper is 
networks, and the emphasis of the 
engineering is on customer require- 
ments for reliability and availability. 
Router clusters, described here by 
Peter Higginson and Mike Shand, 
were developed to provide fast fail- 
over response in IP networks and are 
defined as a group of routers on the 
same local area network (LAN) pro- 
viding mutual backup. New router 
cluster protocols and mechanisms 
restrict the loss of service that results 
from a failure on the network, speci- 



fically on networks requiring high 
availability, such as telecommunica- 
tion and stock exchange networks. 
The authors analyze failure cases and 
present die solutions that reduced 
service-loss time from approximately 
30 to 45 seconds to 5 seconds in both 
LAN and WAN environments. 

Collaboration software for desk- 
top systems can be broadly defined 
to encompass a range of capabilities, 
from a simple transfer of data between 
users, such as e-mail sent over a net- 
work, to real-time sharing of text, 
graphics, and audio and video data. 
Larry and Ricky Palmer have designed 
a software product, called Shared 
Desktop, for users who want to share 
three-dimensional graphics applica- 
tions and audio across networks. 
Notably, the design differentiates 
itself by supporting multiple operat- 
ing systems, currently enabling real- 
time interopcration among Windows 
and UNIX systems. The authors dis- 
cuss the decision to create a "view- 
port," which is a part of the desktop 
screen, and issues they addressed dur- 
ing implementation, including proto- 
col splitting, screen capture and data 
handling, and dissimilar frame buffers. 
They conclude with ideas for possible 
enhancement of the product in the 
future. 

In a previous issue of the Journal 
featuring technical computing topics 
(vol. 7 no. 3), Jonathan Harris et al. 
described DIGITAL'S Fortran 90 
compiler that implements High 
Performance Fortran version 1.1, 
a language for writing parallel pro- 
grams. An outgrowth of that work 
is an experimental debugger, code- 



named Aardvark, that "reconstructs" 
for the HPF programmer a single 
source-level view, even though the 
program has several flows of control 
and the data are distributed. David 
LaFrance- Linden discusses the chal- 
lenges faced in creating the debugger 
and describes useful techniques and 
concepts, such as logical entities, that 
can be generally applied to debugger 
design. 

Readers interested in past issues 
of the Journal are invited to visit the 
Journal Web site at http://www. 
digital.com/info/dtj/. Our next 
issue will address such topics as opti- 
mization of NT executables on Alpha, 
a new graphics program, and VLM. 
A Special Issue on programming lan- 
guages and tools is being developed 
for publication in the fall of 1998. 




Jane C. Blake 
Managing Editor 
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Richard Lary 

DIGITAL Storage Technical Director 



Welcome to the winter 1997-98 issue 
of the Digital Technical Journal. This 
issue does not have a single theme; 
it contains a potpourri of papers on 
a wide range of technical topics. This 
provides the foreword writer with a 
small gift and a not-so-smalJ headache. 

The gift is the opportunity to tout 
the continuing fecundity of DIGITAL'S 
engineering community. All the 
papers in this issue of the Journal 
come from product development 
groupsin DIGITAL, and aJJ the tech- 
nology described herein is directly 
applicable to the problems of using 
computers in the real world. The 
papers themselves cover a wide range 
of topics: designing storage buses 
and their infrastructure; building IP 
routers that reduce network delays 
caused by Link or router failure; sharing 
3-D graphical and audio data across 
networks of computers with different 
windowing systems; and debugging 
programs written in languages that 
incorporate data parallelism. 

The headache, of course, stems 
from this very diversity. Any attempt 
to derive some set of common under- 
lying principles other than "make 
better stuff" from this collection is 
doomed to sophistry. And my techni- 
cal background is too narrow to pro- 
vide any significant embellishment to 
any of the papers outside the domain 
of storage systems. So, with apologies 
to the other authors, I am forced to 
restrict my comments to what I know 
— the background and impact of Bill 
Ham's work on advances in parallel 
SCSI which are presented in his paper 
in this Journal. 

BiJJ Ham's paper not onJy describes 
a significant technical achievement; it 
illustrates DIGITAL'S shift from engi- 



neering proprietary storage systems 
to engineering open storage systems. 

The SCSI bus was developed dur- 
ing the early 1980s as one of many 
attempts to standardize the interface 
to storage devices. It succeeded beyond 
the expectations of its developers, 
largely because it supported a device 
model that was abstract enough to be 
extensible but inexpensive enough to 
be implemented in the technology of 
the time. For all its advantages, how- 
ever, SCSI suffered from poor engi- 
neering at the physical level. This was 
a direct result of the way it was devel- 
oped. The diverse corporate repre- 
sentatives that defined SCSI did not 
have the time or money to specify and 
buiJd custom bus infrastructure com- 
ponents (transceivers, cables, termi- 
nators, etc.), so they used commonly 
available parts. A lack of sophistica- 
tion in specifying physical interface 
parameters resulted in a specification 
that allowed too much component 
variation. As a result, it was difficult 
to build reliable, multi-box systems 
using SCSI. 

DIGITAL'S attitude towards SCSI 
during this period was to ignore it 
and hope it would go away. We had 
designed our own proprietary Digital 
Storage Architecture (DSA), which 
utilized an abstract and extensible 
device model and also incorporated 
many large system features, including 
a robust physical interconnect. We 
controlled the design and manufac- 
ture of all DSA components and 
could thus guarantee that they all 
met tight architectural specifications. 
Moreover, DSA was a key enabling 
technology for VMS Clusters, the 
individual DSA components were 
competitive with their counterparts 
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from rhc proprietary storage architec- 
tures of other large systems compa- 
nies, our customers were happy, and 
the storage business was profitable. 
We were feeling quite pleased with 
ourselves — and we were profoundly 
ignorant of the power of a successful 
open market standard, since one had 
never existed in the storage world. 

During the latter half of the 1980s, 
SCSI grew steadily in popularity until 
it dominated the workstation and 
small-server markets. These systems 
had at most a few disk drives on 
them, and SCSI's signal integrity 
problems were manageable in that 
context. They were not manageable 
in the larger and more demanding 
data center systems, and so SCSI was 
not used there. The SCSI standards 
group was aware of the bus's deficien- 
cies, however, and as the decade pro- 
gressed, the group made amendments 
to the standard to eliminate many of 
them. By the turn of the decade, sex - 
oral independent subsystem vendors 
were selling subsystems utilizing SCSI 
devices as storage for large DIGITAL 
systems. These subsystems did not, 
in general, have the features, perfor- 
mance, or robustness of our subsys- 
tems, but they were significantly 
cheaper and improving all the time. 
By 1991 , it had become obvious to 
us that we would not be able to com- 
pete with these systems in the long 
run. They were leveraging an entire 
industry's investment and talent and 
were reaping the cost benefits of 
high-volume manufacturing; whereas 
we had to design and manufacture (at 
relatively low volume) every component 
of every DSA system ourselves. 



Our position was untenable. We had 
to change our strategy and embrace the 
bus that we had so studiously ignored. 

We designed a modular packaging 
architecture for SCSI devices (known 
commercially as StorageWorks) and 
a set of storage array controllers that 
interfaced these devices to our sys- 
tems (and systems from other major 
vendors as well). We also became 
active participants in the SCSI stan- 
dards process. Where DIGITAL had 
prev iously sent one or two engineers 
to SCSI standards meetings strictly to 
gather information, we started to send 
up to half-a-dozen engineers to listen, 
learn, participate in debate, help 
with the grunt work of the standards 
process, and make proposals to amend 
or extend the standard in directions 
useful to us and our customers. 

Our new modular packaging 
design allowed our customers to 
install and remove storage devices 
themselves and to migrate storage 
devices between systems, even 
between systems built by different 
svstem vendors. This modularity 
proved to be a very valuable feature 
to our customers. Howev er, it 
required us to build a physical infra- 
structure for the SCSI bus that had 
the robustness needed by our large 
systems and that could accommodate 
a great deal of variability in configura- 
tion, and to use a bus that was known 
to have residual signal integrity prob- 
lems in its physical interconnect. We 
were understandably worried about 
this, worried enough to charter a 
small group of engineers as a SCSI 
Bus Technical Office (SBTO) within 
the storage group, and to develop 



short-term configuration guidelines 
for our packaging architecture and 
long-term technical proposals for the 
SCSI physical bus architecture. Bill 
Ham hns been the head of SBTO 
since its inception and has also been 
our representative to the SCSI com- 
mittee on all matters relating to the 
physical bus interconnect. 

In the summer of 1993, Bill com- 
pleted a study of the signal integrity 
issues surrounding parallel SCSI. His 
conclusions were startling. The SCSI 
standards committee had, over the 
years, made enough improvements in 
the basic transmission line characteris- 
tics of the SCSI bus that most of the 
remaining signal integrity problems 
were due to the variations in compo- 
nent parameters allowed bv the SCSI 
specification. Exercising tighter con- 
trol over component variation — 
through building selected compo- 
nents or through purchase specifica- 
tions with our suppliers — would not 
only produce excellent signal integrity 
in our packaging but would allow the 
maximum clock rate of the bus to be 
doubled while maintaining excellent 
signal integrity and backwards com- 
patibility with existing SCSI devices. 
Bill's results also indicated that the 
maximum clock rate could be increased 
even further, with more work. 

This discovery came at a critical 
time in the evolution of the SCSI 
standard. Much of the SCSI standard 
committee's effort in the early part of 
the 1990s was being spent in modify- 
ing the SCSI standard so that serial 
buses could carry the higher level 
SCSI bus protocols. The committee 
had started this work under the 
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assumption that parallel SCSI was 
"our of gas" in performance, and 
the new serial bus variants would 
supplant it by mid-decade. However, 
by 1993 not only was the definition 
and implementation of the serial bus 
going slower than expected, but there 
were three independent and incom- 
patible serial bus proposals, each with 
unique useful features and unique 
drawbacks, each with a cadre of sup- 
porters among the industry represen- 
tatives. The market would ultimately 
choose which serial buses would 
thrive; but it was highly unlikely that 
all three would thrive. Storage ven- 
dors that made the wrong bus choice 
would suffer for it. Most galling to 
the technophiles among us, the mar- 
ket's choice could not be predicted 
from the technical merits of the con- 
tenders. If it could, we'd all have 
Betamax YCRs in our homes today. 

So, DIGITAL decided to have Bill 
present his results to the SCSI com- 
mittee at its November 1993 meeting 
and recommend that the committee 
extend the SCSI specification to allow 
the bus to run at up to twice its old 



maximum clock rate if the components 
in the physical interconnect met the 
tighter specifications. Our motive in 
doing this was purely selfish: we were 
not ready to choose among die serial 
bus proposals, yet we would soon 
need more performance than parallel 
SCSI could offer. A higher perfor- 
mance parallel SCSI would aJlow us 
to improve our storage subsystem 
performance without having to stake 
our fortunes on a potential Betamax. 

Bill's presentation at the SCSI 
committee meeting was met with 
enthusiastic approval. It turned out — 
surprise! — that other system vendors 
were feeling as uneasy as we were 
about the serial SCSI buses. The pro- 
posal, christened UltraSCSl, was 
adopted as an extension to the parallel 
SCSI standard. Bill Ham and the 
SBTO then worked with component 
vendors and the SCSI committee to 
develop the thinner cables, smaller 
connectors, and SCSI expander cir- 
cuits described in his paper, all with 
the aim of keeping parallel SCSI as a 
desirable alternative to the serial SCSI 
buses. Today, four years after its com- 



mittee debut, UltraSCSl is solidly 
entrenched in the storage market. In 
fact, storage market analysts are now 
projecting that the combined volume 
of devices on all serial SCSI buses 
(yes, there are still three, but the 
market has already picked one, Fibre 
Channel, as the winner) will not 
exceed parallel SCSI device volumes 
until early in the next century. And 
the SCSI committee has finished 
extending the parallel SCSI specifi- 
cation to achieve a second doubling 
of maximum bus clock and is in the 
midst of defining a third doubling. 

Without hyperbole it can be said 
that the technology embodied in Bill 
Ham's paper has directly affected the 
course of the computer storage indus- 
try, and it continues to affect posi- 
tively DIGITAL'S position in that 
industry. Enjoy reading the paper 
and those that follow it in this issue. 
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WilJiam E. Ham 



Recent Advances in Basic 
Physical Technology for 
Parallel SCSI: UltraSCSI, 
Expanders, Interconnect, 
and Hot Plugging 

DIGITAL uses SCSI technology in most of its 
storage products and consequently has led 
major standards and industry bodies to improve 
the technology in the following areas: increased 
synchronous data phase speed beyond fast SCSI; 
longer, more complex electrical configurations 
by means of expander circuits; versatile and 
more manageable connectivity through a 
smaller, improved physical interconnect; and 
dynamic device insertion and removal. Data 
phase transmission rate extension is achieved 
through understanding and controlling silicon 
chip timing and transmission media parameters. 
Using expander devices to confine transmission 
line effects to shorter segments allows large 
increases in the maximum distance between 
devices and in the device population within the 
same SCSI domain. Expanders enable complex, 
hublike configurations to be created without 
changing existing SCSI devices or software. 
The use of 0.8-millimeter connector technology 
and consideration of cable losses has reduced 
the physical size of the external shielded inter- 
connect by approximately two thirds, decreased 
the number of parts required to support com- 
plex configurations by a factor of 10, and 
increased the interconnect density to the same 
level used in serial SCSI. Finally, the mating and 
demating events that occur during device inser- 
tion and removal produce a spectrum of small, 
undetectable, electrical disturbances on the 
active busthatappearto be limited by the 
physics of the media and device capacitance. 



Introduction 

Parallel Small Computer System Interface (SCSI) is the 
workhorse technology for most of the storage applica- 
tions in DIGITAL products today. This device and 
interconnect technology spans all system offerings 
from the s i m plest t o t h e most complex . S CS I was intro- 
duced to the higher-end products in the early 1990s as 
the open systems follow-on to the DIGITAL propri- 
etary Digital Storage System Interconnect (DSSI) and 
Computer Interconnect (CI) technologies. 

As system demands have increased, SCSI has evolved 
to meet the needs. DIGITAL has made considerable 
contributions to the technology and led the effort to 
achieve industry standardization. This paper details the 
most significant developments in die physical features 
of parallel SCSI technology over the last several years 
that have allowed it to continue to serve DIGITALcus- 
tomers in an effective, competitive way. The discussion 
targets the following four important areas: 

1. Speed increases in the synchronous data phase, 
which resulted in the ANSI definition of UltraSCSI 
(Fast-20 SCSI) technology 1 

2. Development of software-invisible circuits, gener- 
ally called expanders, that enable segmentation of 
SCSI domains i nto easily managed pieces 

3. New connector and cable technology, namely the 
Very High Density Cabled Interconnect (VHDCI ) 
device, that decreases the interconnect size and 
complexity by many fold 2 

4. Dynamic removal and replacement of devices on an 
active bus, which is referred to as hot plugging 

DIGITAL made substantial contributions in the 
four areas. This work included creating the expander 
and interconnect standards projects; leading the work- 
ing groups that defined the Fast-20, expander, and 
interconnect standards; providing data for the Fast-20 
and hot-plugging projects; and proposing and gaining 
approval for the hot-plugging standard. 

The author has taken a phenomenological approach 
throughout, because in most cases there are too many 
unknowns to achieve a rigorous analytical result. This 
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paper focuses on developments from SCSI-2 through 
UltraSCSI and specifically does not address the new 
Low Voltage Differential (LVD) technology being 
introduced for the highest-speed applications. 

Pedigree 

SCSI is defined in several ANSI standards' u and in the 
material that was developed to create these standards 56 
The standards were generated over the last decade 
through a cooperative effort of approximately 60 major 
companies in the computer and computer support 
industry. As a result of this pedigree, die prime directive 
for SCSI technology is interoperability of devices 
designed and manufactured by different companies. 

The details of the physical designs used to implement 
SCSI may not be visible to users and researchers; these 
details contain much of the marketing and technical 
differentiation between the products of the participat- 
ing companies and are therefore hidden in the silicon 
design. The behavior at the device connector pervades 
the SCSI specifications. The basic assumption is that as 
long as the properties are compatible at these connec- 
tors, device substitution is possible. Thus, SCSI devices 
may be both interoperable and of different designs. 

Basic Architecture 

This section reviews the basic architecture of parallel 
SCSI. The SCSI bus is a parallel, multidrop, vvired-OR 
configuration. 

Signal Multiplexing and Phases The parallel signal 
construction of the bus allows multiplexing of some 
signals during different phases of communication so 
that the same signal lines may have very different func- 
tions in different phases. The physical behavior of sig- 
nals is usually limited by the phase during which the 
shortest pulses are used and the demands for signal 
integrity are the highest. The limiting SCSI phase is 
the data phase (payload phase) that is executed with 
the highest synchronous rate. For UltraSCSI, this peak 



repetition rate is 20 megahertz (MHz). Table 1 con- 
tains the generally accepted terminology related to 
data phase speeds. 

Because of the wired-OR property, each signal in 
the bus must be driven to a known state even if no 
SCSI device is actually driving the signal. SCSI uses the 
logical 0 state (negated state) as the undriven state and 
uses die bus terminators to drive the signal to this state 
in the absence of any driving devices. The device signal 
drivers must overcome this terminator-driven logic 
state of 0 in order to send a logical 1 (asserted state) 
onto the signal line. 

SCSI signals must support all frequencies, from stat- 
ically driven by the terminators only (DC) to the third 
harmonic of the fastest signal edge in the synchronous 
data phase. In many cases, the same wire must support 
ail these frequencies at different times during the SCSI 
protocol. 

The highest signal edge slew rates for UltraSCSI 
are approximately 500 millivolts per nanosecond 
(mV/ns). A 2-volt (V) transition requires approxi- 
mately 4 ns/5.4 ns/meter (m) = 0.74 m for a signal 
edge (assuming 5.4ns/m as the propagation velocity 
of the signal edge). Therefore, some relief exists 
because the connectors and cable assembly termina- 
tions are much smaller than the signal edge length; the 
connectors and terminations do not need to have care- 
fully controlled characteristic impedance properties. 
This allows the use of the technology available in the 
connector and cable assembly industry to optimize 
the interconnect properties without the considerable 
design, manufacturing, and test burden imposed by 
controlled impedance requirements. 

Transmission Modes The transmission mode of a 
SCSI bus is determined by the properties of die 
terminators that, by definition, constitute the ends of 
the bus. Terminators also supply most of the energy 
required to operate the single-ended transmission -mode 
devices and additionally provide the required matching 



Table 1 



Terminology for Data Phase Speeds 




Maximum Transfer 


Maximum Byte 


Maximum Byte 




Rate(Million 


Rate (Narrow) 


Rate (Wide) 


Data Phase Speed Name 


transfers/second) 1 


(Megabytes/second) 


(Megabytes/second) 


Asynchronous 
Slow (synchronous) 
Fast (synchronous) 
Ultra (synchronous) 2 
Ultra2 (synchronous) 3 
Ultra3 (synchronous) 4 


Unspecified 
5 

10 
20 
40 

80 to 100 


Typically ~ 3 

5 

10 
20 
40 

80 to 100 


Typically ~ 6 
10 
20 
40 
80 

160 to 200 



'One transfer is 1 byte in narrow mode and 2 bytes in wide mode; 1 byte equals 8 data bits plus 1 parity bit. 
2 Ultra is synonymous with Ultra 1 and Fast-20. 
3 Ultra2 is synonymous with Fast-40. 

'Rates not yet finalized; Ultra3 is synonymous with Fast-80 or Fast-100. 
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to the characteristic impedance of the transmission line. 
In differential SCSI, the terminators provide a small 
portion of the overall energy required to operate the 
bus; the differential drivers supply the remainder of 
the energy. 

Drivers that want to transmit an asserted state 
must overcome the biasing prov ided by the termina- 
tors. The drivers operate locally on the bus mid alter 
the state in their immediate vicinity when thev switch 
on and off. For single-ended SCSI, the 0 state is 
approximately 2.5 V and the 1 state is approximately 
0.5 V. For high-voltage differential SCSI, the 0 state 
is approximately -1 V to -2 V, and the 1 state is 
approximately 2 V. (The difference between a state 
I and a state 0 is higher with differential — typically, 
approximately 4 V.) 

For single-ended transmissions, the driv ers operate 
on energy previously stored in the bus by the termina- 
tors. This energy is mostly electrostatic energy in the 
charge stored in the capacitance of the transmission line 
for negated states and electromagnetic energy in the 
current flowing through the inductance of the trans- 
mission line for asserted states. Ultimately, the termina- 
tors will set the state back to negated after the drivers 
cease to source or sink current; however, this only hap- 
pens after the round-trip propagation delay from the 
driver to the farthest terminator if the bus does not 
have matched characteristic impedance properties. 

Approximately the same energy transformations 
occur for differential SCSI, but significant current is 
supplied by the drivers for both the asserted and the 
negated states. 

Multidrop Requirements The multidrop architecture 
requires a continuous low-resistance path called the 
bus path between the terminators and allows devices 
to be attached to this path. The number and proper- 
ties of these attached devices vary widely because of 
many factors including the speed of operation, the 
overall length of the bus, and the transmission mode. 
Attached devices always disturb the transmission line 
properties of the bus path; the key to successful opera- 
tion is in the management of the magnitude of these 
disturbances. 

Generally, the more capacitance or electrical length 
the device has, the more disruptive it is. Placing devices 
too close together along the bus path can cause them 
to appear electrically as a single super disruptive device. 
Placing them too far apart can result in an ov erall bus 
length that is too long. 

Wired-OR Glitches During the arbitration phase, 
when the SCSI devices decide which devices will be 
sending payload data to or from each other, multiple 
devices may assert the same control line (BSY) at the 
same time. Each device that wishes to communicate 
asserts both the BSY line and its respective device 
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identification (ID) line. After examining the asserted 
ID lines to determine which device has the highest ID, 
all but the device with the highest ID release the BSY 
line. This leaves only one device, the winner, asserting 
the BSY line. While the current in the BSY line is read- 
justing itself from a multiple-driver asserted condition 
to a single-driver asserted condition, noise pulses (called 
w ired-OR glitches) propagate throughout the length of 
the signal line and may be detected collectively as an 
erroneous phase. Therefore, one of the architectural 
limits for parallel SCSI is the time required for these 
wired-OR glitches to settle. This bus settle time is set by 
protocol at 400 ns and must be interpreted as a round- 
trip propagation time when using a simple SCSI bus. 
Allowing some time for propagation through driver and 
receiver chips yields a maximum physical length for a 
simple bus of 25 meters. 

Areas of Improvement 

Thus, the opportunities for improving SCSI derive 
from appropriately managing the transmission lines, 
taking advantage of the multidrop architecture offered 
bv a parallel wircd-#R structure, using state-of-the-art 
technology from the interconnect and silicon industry, 
and making innovative use of the time required fsr the 
wired-OR glitches to settle. These techniques are the 
basis of the development by DIGITAL in the fsur areas 
addressed in this paper. 

Speed increases in the synchronous data phase are 
based primarily on increasing the timing precision 
in the silicon transceivers by using newer silicon tech- 
nology. The interconnect properties remain largely 
uncharged from those used for fast SCSI. 

Circuits that enable segmentation of SCSI domains 
into easily managed pieces are based on systematic 
isolation of transmission line properties and use of 
wired -OR noise pulse properties. No software, inter- 
connect, or device changes needed to use these circuits. 

New connector and cable technology is based on 
an innovative 0.8-millimeter (mm) ribbon-style con- 
nector technology that optimizes the total SCSI elec- 
trical requirements with the capabilities of cable and 
connector design. 

Dynamic removal and replacement of devices on an 
active bus, i.e., hot plugging, is based on the multidrop 
architecture, which enables devices to be added or 
replaced without affecting continuity between other 
devices. Hot plugging depends on understanding and 
managing the electrical disturbances created during 
the insertion or removal. 

The remainder of this paper provides details of these 
tour areas of improvement. The end result of these 
extensions to the basic physical architecture of parallel 
SCSI is a major increase in its capabilities, accompa- 
nied by only a verv minor disturbance to the installed 
base, especially rhe software. 



Increasing the Synchronous Data Phase Speed 

Beginning with the SCSI-2 standard, the synchronous 
transmission mode is available for transferring payload 
data between SCSI devices. The devices select this 
mode by mutual agreement before any synchronous 
data is passed. The agreement is achieved by using the 
asynchronous transmission mode, which is slow but 
usually reliable. 

The synchronous data phase uses the DATA and 
PARITY bit lines for the data and either the REQ or 
the ACK control line as a signal that the receiver uses 
for capturing the data. The term synchronous derives 
from a specified timing relationship between the bit 
line signal edges and the REQ or ACK signal edges. 
(The falling edge of the ACK signal is used when the 
data phase transmission originates from the SCSI ini- 
tiator, and the falling edge of the REQ signal is used 
when the transmission originates from the target.) 
There is no synchronous relationship between the 
internal timing references on different SCSI devices, so 
the receiver must buffer the received data before intro- 
ducing the data into its internal data management 
structure. This buffering is usually accomplished by 
means of a first in first out (FIFO) circuit that uses the 
REQ or the ACK signal as the latching signal for the 
incoming data. For convenience, in this paper we only 
refer to the ACK signal, with the understanding that 
the same discussion applies to the REQ signal when it 
is used as the data-latching signal. 

Since only the falling edge of the ACK signal is used 
in the presently specified SCSI versions and an ACK sig- 
nal is required for every data transfer, it follows that the 
ACK signal cycles at least twice as fast as the data bits. 
When a continuous stream of transfers is transmitted, 
the ACK signal is a regularly repeating signal, nomi- 
nally, a square wave. An alternating I/O pattern pro- 
duces the highest fundamental frequency for the data 
bits at half the frequency of the ACK signal. Therefore, 
the ACK signal requires careful attention since it is the 
most demanding on the transmission process. 

The focus of this section is to examine how the 
speed of the synchronous data phase was increased by 
a factor of two to achieve the Fast-20 (UltraSCSI) 
specification. 

Status before UltraSCSI 

In 1993, the SCSI-2 standard 3 had been in place 
for two years, and a follow-on standard called SCSI -3 
Parallel Interface (SPI) 4 was technically stable. SPI had 
been created largely because the specifications in the 
SCSI-2 standard were not effective in implementing 
the single-ended version of the synchronous transmis- 
sion (10 megatransfers per second). The differential 
version specified in SCSI-2 worked well but was much 
more expensive in cost, power, and space than the 



single-ended version. Therefore, most of the interest 
was in making the fast single-ended version work 
adequately. 

Taking single-ended SCSI from asynchronous and 
slow synchronous (5 megatransfers per second) to the 
fast synchronous technology was difficult. The prevail- 
ing opinion was that the SPI standard represented 
the final improvement to parallel SCSI. This view 
set the stage for a number of alternate physical techno- 
logies based on the serial point-to-point transmission 
schemes used in communications technologies, e.g., 
Fiber Distributed Data Interface (FDDI) and Ethernet, 
to be used for higher-performance storage applications. 

DIGITAL'S Storage Bus Technical Office had seen 
many instances of difficult implementations that were 
the result of less-than-optimal understanding and 
management of the specification margins. No credible 
study had been presented on the margins available in 
SCSI, so the thrust was to create baseline characteris- 
tics of multidrop parallel SCSI to determine where 
unused margin might exist. 

Little data was available on the precise reasons why 
specific implementations of fast synchronous SCSI did 
not work. The system would hang or report various 
error messages with almost no indication of the basic 
causes. A method that could report margin to failure 
and mechanism of failure was needed to unravel this 
situation. Therefore, the approach DIGITAL took was 
to step back from Rill SCSI implementations and to 
examine the pieces without the encumbrance of the 
SCSI protocol. 

One of the most mysterious areas was the behavior 
of SCSI receivers. The SCSI-2 and SPI specifications 
used bipolar transistor-transistor logic (TTL) levels as 
the basic receiver input levels. Almost all SCSI devices 
were being designed with complementary metal -oxide 
semiconductor (CMOS) technology, so the differ- 
ences between the receiver properties presented a key 
opporamity for hidden margin. Other unknown areas 
were jitter, cross talk, skew, ground offset, effects of 
stubs, and worst-case configurations. 

DIGITAL built a special test environment to sys- 
tematically examine each piece of parallel SCSI. The 
environment was named the PBDIT, an acronym for 
parallel bus data integrity tester. This test environment 
made it possible to systematically examine the real 
margins to failure for the key pieces and to develop the 
confidence that SCSI could be used at elevated speeds 
and be made highly robust at the slower speeds. 

Special Test Environment 

The test environment was built to allow known data 
patterns to be transmitted across a SCSI device, into 
SCSI transmission media, and then into another SCSI 
device. The same data pattern is loaded into both sides 
so the receiver knows exactly what data it is supposed 
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to receive. The transmitting side is called the exciter, 
and the receiving side is called the comparator. 
Received data is committed to the comparator by 
using one bit line as the latching ACK signal in a man- 
ner exactly like that specified in synchronous SCSI 
transmissions. The test environment allows the posi- 
tion of the ACK signal to be adjusted with respect to 
the data signal edges. 

Since the comparator knows the data pattern that is 
transmitted, it is possible to isolate the precise data bit 
that caused the transmission error. This kind of error- 
directed methodology has found widespread use in 
the integrated circuit industry. 

Other features of this test environment include 
detachable load boards that contain the SCSI drivers, 
terminators, receivers, connectors, or any other physi- 
cal media-dependent components. The minimum 
requirements for a load board are that the exciter con- 
tain the SCSI driver and a connector and that the com- 
parator contain the SCSI receiver and a connector. 
Other components may be placed between the load 
boards for different test conditions. The SCSI driver 
must have accessible points for the exciter logic, and 
similarly, the SCSI receiver must have output points to 
drive the comparator. These requirements eliminate 
drivers and receivers that are imbedded within chips 
with other functions. Fortunately, separate SCSI dri- 
vers are available for both single-ended and differential 
versions. (The differential versions normally use sepa- 
rate chips, but only a few choices are presently avail- 
able for the separate single-ended versions.) 

The test environment is useful for developing the 
understanding of operating mechanisms and for mea- 
suring the margins for specific hardware configurations. 
This environment is not useful for deriving specifica- 
tions, since the performance at the specified interfaces, 
i.e., the device connectors, is not directly observable. 



Oscilloscope measurements provide the basis for setting 
compliance specifications, since these measurements 
can be performed at the connectors. The basic question 
that needed an answer was, Can parallel SCSI be oper- 
ated at elevated speeds with reasonable margin to fail- 
ure? DIGITAL optimized the special test environment 
to answer this question. Other specifications that would 
be necessary to ensure interoperable operation between 
UltraSCSI devices could be denved if it appeared possi- 
ble to achieve the end result. 

The data pattern loading and digital control of the 
exciter and the comparator were achieved through opti- 
cally coupled means. This allowed the ground offset volt- 
age to be adjusted between the driver and the receiver 
without compromising the operation of the logic. 

The data Hows only from the exciter to the 
comparator. If bidirectional information is desired, 
the physical connections between the exciter and 
comparator have to be reversed. This scheme leaves 
untested the cross- talk effects on the REQ signal that 
is traveling in the opposite direction to the ACK signal 
(if ACK is synchronized with the data as in a write 
operation). Separate measurements are necessary to 
examine this issue. Cross talk into other control lines is 
addressed by holding these lines constant in the data 
pattern transmitted . 

The SCSI standard deals with the REQ cross-talk 
issue by requiring that the data lines be physically sep- 
arated from the REQ and ACK lines in the transmis- 
sion media. Measurements not reported in this paper 
have confirmed negligible speed-related cross talk into 
the RKQ line. 

Up to 27 pairs of 3-byte-wide lines (wide SCSI uses 
only 18 pairs for high-speed transmissions) can be 
tested with the special test environment. Figure 1 is a 
functional diagram of the test environment. The SCSI 
terminators are shown as separate from the load 
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boards in this case. A key feature of this kind of testing 
is that the test does not necessarily stop when an error 
is detected. In fact, the environment may detect errors 
100 percent of the time. This acceptable behavior 
allows mapping of the complete bit-error response of 
the system. 

Sample Data from the Special Test Environment The 

test environment allows a multitude of tests to be per- 
formed. The test scheme described in this section is 
the one that was used to establish the basic timing 
margins available from normal SCSI silicon, cables, 
connectors, and terminators. 

A random repeating data pattern with 1 6 thousand 
different bit combinations was used as the basic data 
pattern. This pattern was transmitted over a period of 
time, and the number of errors detected was recorded. 
In this test, an error is defined as one or more bits in the 
received data transfer that do not match die transmitted 
bit. To acquire a new error rate data point, the transmis- 
sion test is repeated by using exactly the same number of 
transfers in the same time period with the same data pat- 
tern but with some test parameter changed. 

Virtually any parameter can be varied for different 
tests. For a given physical configuration, the most use- 
ful parameter for determining the timing margin is the 
position of the ACK pulse with respect to the data 
edges. The basic data then becomes the number of 
errors detected and the position of the ACK pulse edge. 

There are two basic random variables operating in 
this scheme: the data pattern and the jitter induced by 
non-data-dependent sources. It is easy to separate these 
two variables by using extremes in the data pattern: 
very few transitions and the maximum number of tran- 
sitions (every data edge has a transition, i.e., alternating 
I /0 pattern). Although this level of precision is avail- 
able, we will see that we really do not need to bother 
for parallel SCSI at the maximum UltraSCSI rate. 



Figure 2 shows a typical error rate plot from a sim- 
ple single-ended configuration made from ordinary 
SCSI interconnect hardware and transceivers being 
tested at the maximum UltraSCSI rate. Each data 
point represents a 3-second sample (60 million trans- 
fers) at each ACK position. The ACK position is incre- 
mented in 0.1 -ns steps for a total of 240 independent 
tests in the plot. To minimize the testing time, we 
tested only the time ranges from -3 to 9 ns and 44 to 
56 ns. The individual data points are not distinguish- 
able in this presentation, and there is very little scatter 
between neighboring points. In Figure 2, the error rate 
of 1 is used to indicate that no errors were detected, 
since the log of 0 is not easy to plot. 

Examination of the raw data reveals that the plot is 
monotonic in detected error rate to the fourth decimal 
place. This indicates an extremely predictable situation 
as far as behavior of the same set of hardware is con- 
cerned. That is, there is virtually no Gaussian jitter pre- 
sent, and a SCSI system could be designed to be quite 
reliable and stable at the maximum UltraSCSI rate. 

Extending the sample period to 5 minutes made no 
djfference in the position of the key features. Using the 
3-second sampling time, the entire data set could be 
acquired automatically in approximately 12 minutes. 

The onset of errors is extremely sharp as the ACK 
position approaches the critical position. One hundred 
picoseconds changes the observation from 0 to 864 
errors near the 8-ns position. On the other end, the 
50.1-ns time produced 7 errors, and the 50.2-ns time 
produced 425 errors. No errors w ere detected at any 
of the times between 50.1 ns and 7.9 ns. This data 
shows that there are no strange effects that prevent 
SCSI from operating at the maximum UltraSCSI rate. 

As the ACK position proceeds into the region of 
more errors, a condition is finally reached in which all 
the transfers have errors. On the one hand, the proba- 
bility that one transfer has the same data content as its 
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neighbor's is very small with this random data pattern. 
On the other hand, since a random data pattern is 
being used, there is a reasonable chance that a bit will 
actually match that transmitted in one state but not in 
the other state. The random data pattern tends to 
spread out the time between the first error and the last 
good transfer. In the limit, for perfectly random data, 
this time is a measure of the total timing imprecision in 
the system. 

This imprecision includes skew in the exciter and 
comparator boards, in die SCSI drivers and receivers, 
and in the cable transmission media (including loads, if 
any), and all forms of jitter. For the test conditions shown 
in Figure 2, the total difference is 3.6 ns near the 5-ns 
point and 5.4 ns near the 52-ns point. This shows that 
the skew specifications in the SCSI standard are over- 
specified as compared to actual hardware performance. 

The data shown in Figure 2 is representative of a 
large variety of configurations up to approximately 3 
meters long and loaded or up to much longer point- 
to-point lengths (20 meters or more [see Figure 6 |). 
The error- free window can be made to collapse by 
adding too many loads or by using the wrong imped- 
ance cable, improper terminators, receivers with the 
wrong threshold voltages, or other bus component 
and configuration parameters. However, the details of 
the actual hardware and configuration do not affect 
the basic conclusion derived from Figure 2, namclv, 
that a great deal of riming margin is available at the 
maximum UltraSCSI rate when ordinary SCSI hard- 
ware is used . 

To put this into perspective, basic gigabit-per-second 
serial transmissions with approximately twice the basic 
bandwidth of UltraSCSI have bit times of about 1 ns 
and timing margins of a few hundred picoseconds. 
UltraSCSI has an effective margin window of a few tens 
of nanoseconds. This represents two orders of magni- 
tude more margin for die parallel SCSI application. 



The initial errors usually originate from the same 
bit. This bit is the one with the most unfavorable tim- 
ing skew with respect to the ACK signal. The cliff is 
not perfectly sharp because there is a 50 percent 
chance that the data transmitted is the same as that 
expected even under the error case and, more impor- 
tantly, because there is some level of jitter present. It is 
this jitter that softens the cliff. Thus, the first errors 
detected happen when the skew of the weakest bit 
adds to the tail of the jitter distribution. Only a few 
errors are present because only a small part of the jitter 
population extends far enough to trigger the error. 
SCSI systems will experience virtually no errors 
because of these mechanisms in service if one operates 
1 ns or more away from an error cliff. 

Note that these results from the special test environ- 
ment almost always yield margins higher than those 
calculated from a set of interoperability specifications. 
This is because the interoperability specifications must 
allow margin for each piece, and the special test envi- 
ronment reports the integrated result from many 
pieces in the complete SCSI connection. 

Higher Speeds The main effect of further increasing 
the transfer rate above the maximum UltraSCSI rate 
in the same set of hardware is to change the time posi- 
tion of the onset of nonzero error rates and to narrow 
the error-free region. Figure 3 shows an example of 
data from Fast-40 transmissions using separate high- 
voltage differential transceivers on each bit. (This data 
was acquired by DIGITAL'S Storage Bus Technical 
Office in 1994.) 

The error-free /one has narrowed to approximately 
15 ns, and the time between first error and 100 per- 
cent errors has widened on both sides, but still no 
uncontrolled regions exist. This strongly suggests 
that at least Fast-40 transfer is possible with no major 
technology changes in the interconnect. 
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Additional Tests Other tests that are useful with the 
special test environment are ground offset effects, ter- 
minator power effects, correlation of time domain 
plots on the signals with error rate distributions, hot- 
plugging testing (which results in good error detec- 
tion), and comparison of the impact of different cables 
and transceivers. Test results of this nature are not 
included in this paper because the impact of these vari- 
ations depends on many parameters and the results 
may not be generally applicable. 

Timing Specification Methodology 

With the increased emphasis on timing precision for 
UltraSCSI technology, it was necessary to introduce 
better specifications for die measurement of timing 
parameters than those in the SCSI-2 and SPI stan- 
dards. Figure 4 shows the precise measurement points 
and features used for the specification of single-ended 
UltraSCSI signals. 

The effects of the finite slew rate on the signal edges 
are accounted for largely by specifying the voltage levels 
that coincide with the receiver input levels. Thus, the 
setup time ends when the receiver is able to detect an 



asserted state at 1.3 V, and die asserted period begins 
when the asserted state has been detected. On the nega- 
tion side, the signal must rise to at least 1.6 V before the 
receiver can detect a negated state, and a negated state 
must be detected if the input signal reaches 1.9 V. In the 
SCSI-2 and SPI standards, any point between 0.8 V and 
2.0 V could be used as the timing measurement. 

Sample UltraSCSI Signals 

Numerous variations on the details of the signals can 
be produced in UltraSCSI configurations. This section 
shows two types of signals as representative examples 
that validate UltraSCSI as viable under certain condi- 
tions. The first case explores a configuration that actu- 
ally exceeds the recommended specifications. This is a 
complex cabled environment with a cluster of loads on 
one end and some distributed loads on the other end. 
The second case shows the signals over a 25 -meter, 
single-ended point-to-point bus. 

Complex Loads Figure 5 specifies a complex con- 
figuration and the single-ended SCSI signals that 
result at various positions along the bus. The logic 
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signal that is driving the SCSI driver chip is the 
first trace at the top; it provides a common timing 
reference for all the signals. The weakest signal is at 
device position 4, just after a relatively long run with 
no loads. This signal is below the 1-V level bur has a 
very slow assertion slew rate that causes considerable 
loss of asserted state pulse width. This complex con- 
figuration works with the receivers used but does 
not have the timing margin required by the Wast-20 
standard. 

By varying the posinon of the loads so that there are 
no loads between the driver and the first load (not 
shown), the signal at the first load device is degraded 
even more rhan at position 4 in Figure 5. This is 
one reason that the overall length of single-ended 
UltraSCSI with manv loads is restricted to l.S meters 



and that the total number of loads is limited to 8.' 
UltraSCSI devices connected to backplanes may be 
especially sensitise to attached cables that extend the 
total bus length more than 6 to 8 centimeters (cm) 
beyond the backplane. This reduced bus length is 
rather sev ere when compared to that allow ed at the 
maximum fast SCSI transfer rate (a total of 3 meters).' 1 
In the section Small, Improved Interconnect, we show 
how to overcome this 1.5-meter, 8-device limit by 
using an active SCSI interconnect. 

Applying the timing measurement methods shown 
in Figure 4 to the waveforms in Figure 5 illustrates 
that more careful timing specification methods do 
indeed help significantly to keep the timing margin 
high enough to use. 
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Point-to-point Configuration If loads arc present 
only at the ends of the bus, the transmission line 
between SCSI devices improves electrically. This 
occurs simply because the loads significantly disrupt 
rhe characteristic impedance and cause reflections and 
attenuation. The point-to-point signal at 25 meters 
has better amplitude and timing margins than signals 
in much shorter buses with closely spaced loads. 
Figure 6 shows a typical example of a point-to-point 
UltraSCSl signal. The format used in Figure 6 is the 
same format used in Figure 5. 

Differential UltraSCSl 

Differential UltraSCSl uses the same configuration rules 
as fast SCSI (25-mctcr total length, 20-cm [8-inch] 
stubs, 16-device load) 1 and uses the same timing values 
as single-ended UltraSCSl. The larger signal amplitudes 
and the common mode rejection property of differen- 
tial transmissions help overcome the transmission line 
weaknesses in heavily loaded and long buses. As with 
any high-voltage differential system the costs — in terms 
of money, power, and space — are higher. 

Other Requirements for UltraSCSl 

The Fast- 20 standard 1 contains a number of detailed 
requirements on the components used in UltraSCSl 
configurations. Included are slight modifications to 
the cable impedance, active negation requirements for 
drivers, special length limits for certain loading condi- 
tions, restrictions regarding the kinds of single-ended 
terminators to use, and timing budgets. 



Summary of Developments in the Area of Increased 
Synchronous Data Phase Speed 

The UltraSCSl (Fast-20) speed increase can be attrib- 
uted to a systematic examination of the margins present 
in actual SCSI hardware and to the elimination of the 
excess margins. Advances in the integrated circuit indus- 
try enabled silicon designs to be specified with tighter 
controls on the driver and receiver timing and threshold 
properties than were possible when the SCSI-2 or SPI 
standards were developed. All the important changes 
needed for SCSI devices are contained in the silicon 
designs for the drivers and receivers. As a result, die user 
sees no difference between the appearance of UltraSCSl 
and that of ordinary SCSI. 

The system integrator must use a more restrictive 
set of configuration rules than required for fast and 
slow SCSI. Also, the only impact on software is the 
addition of a new speed agreement code for the rates 
uniquely supported by UltraSCSl. This negotiation 
is done precisely the same way for UltraSCSl as for 
any other form of SCSI. Finally, UltraSCSl devices 
are 100 percent backward compatible with fast and 
slow devices. Although a device may be capable of 
the maximum UltraSCSl rare, it may be needed in a 
configuration that does not support UltraSCSl. In 
such a case, the UltraSCSl device would be used in 
the fast or slow mode and would have more margin 
at those slower speeds than it would if it were not 
UltraSCSl capable. 
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Bus Expanders 

As noted previously in the discussion of complex loads, 
there are rather severe limits on the configurations that 
can be achieved with single-ended UltraSCSI when 
implemented in a single bus. The extension to parallel 
SCSI architecture that overcomes this constraint 
involves using active circuits that connect SCSI buses 
electrically but isolating them from each other in a 
transmission line sense. These circuits have the general 
name expanders, since they expand the configuration 
capabilities of parallel SCSI. 

Each individual bus has two terminators and its own 
transmission mode (single ended or differential) and 
obeys transmission line-based configuration rules as if 
it were the only bus in the system. When used with 
expanders, these individual buses are called bus seg- 
ments. The collection of SCSI devices in all the bus 
segments that are electrically connected together is 
called the SCSI domain. One example of a SCSI 
domain using expanders is shown in Figure 7. Note 
that w hen using expanders, it is possible to have bus 
segments that do not have any SCSI initiators or tar- 
gets but only serve to form an electrical interconnect 
between other bus segments. 

Expander Properties 

Expanders are available in two basic types: simple and 
bridging. Bridging expanders behave as a SCSI initia- 
tor or target, w hereas simple expanders have a set of 
properties that make them look like a piece of wire 
with delay to the protocol. Simple expanders 

■ Cannot initiate SCSI IDs and arbitrations and can- 
not originate messages, although the expanders can 
read messages sent from initiators and targets 

■ Allow minimal arbitration propagation delay 

■ Yield a retransmitted signal timing skew (both 
delay and high/low) no w orse than from valid SCSI 
initiators or targets 

■ Do not interfere with the KEQ/ACK offset count 

■ Allow m in/max pulse widths to be maintained 

■ Require the filtering of the SCSI RESET line 

■ Allow arbitrary placement of the initiator and the 
targets 



■ Require that terminator power not be connected 
betw een the segments being coupled 

■ Do not need to know the negotiated data phase 
speed or any other variable property of a transaction 

■ Require that there be no electrical or logical connec- 
tion of the DIFFSENS line (a single-ended signal 
that indicates the transmission mode being used on 
the bus segment) between segments being coupled 

■ Issue a SCSI bus RESET signal on one segment on 
detecting transmission mode (single-ended/LVD, 
etc. ) changes on the other segment 

Simple expanders are becoming available from several 
sources in the industrv for use with UltraSCSI. 

Domain Rules Using Simple Expanders 

When using only simple expanders in a domain, six 
rules must be observed: 

1 . All bus segments in the domain must comply with 
their individual bus segment length limits and other 
segme nt-rclatcd req ui re ments. 

2. Any segment between two other segments must 
support the highest performance level that can be 
negotiated between the two other segments. For 
example, two wide UltraSCSI segments must not 
be separated by a segment that does not support 
both wide SCSI and UltraSCSI. 

3. The maximum propagation delay between any 
two devices in the domain cannot exceed 400 ns. 
A special case exists for devices that use extremely 
Jong times for responding to BUS FREE (the 
so-called BUS SET DELAY) — the one-way propa- 
gation limit is 300 ns instead of 400 ns. 

4. The number of addressable devices cannot exceed 
16 unless the domain contains bridging expanders. 

5 . A branch/leaf architecture must be observed; loops 
are not allowed. 

6. The REQ/ACK offset negotiated between any 
two devices must be large enough to ensure that 
adequate offset and buffering is available to accom- 
modate the round- trip time between the devices. 
For the maximum UltraSCSI rate with a 400 ns 
maximum one-way domain propagation time, the 
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minimum offset is 18. (This offset level is derived 
by considering a maximum round-trip time of 800 
ns at 50 ns per transfer [800/50 = 16] and some- 
what arbitrarily adding two transfers to account for 
some additional delay due to the processing time in 
the silicon.) 

Achieving the 400-ns one-way domain delay 
requires expanders that will not pass the wired-OR 
glitch (noted earlier in the introduction) between bus 
segments. This filtering of the glitch allows the bus 
segments to settle individually. 

The propagation delay through an expander 
directly subtracts from the physical distance between 
devices. It is therefore desirable to use expanders 
with small delays. For a singJe-ended-to-single-ended 
application, the delay can be as low as 10 ns. For a 
single-ended-to-differential application, the delay is 
typically around 100 ns, which is another significant 
penalty to using differentia] bus segments. 

More detail concerning these rules and other prop- 
erties is available in the draft ANSI document: SCSI 
Enhanced Parallel Interface* which was edited by the 
author of this paper. 

Summary of Improvements Related to Bus Expanders 

The use of simple expanders dramaticalJy extends the 
utility of single-ended UltraSCSI. The most obvious 
example is the ability to introduce point-to-point 
segments where additional length is needed . A less obvi- 
ous example is the ability to create star or hub configu- 
rations by clustering simple expanders into a local 
physical area. An example of a three-port SCSI hub 
is shown in Figure 7. Note the three simple expander 
circuits internally connected within the hub. Simple 
expanders also make it possible to mix single-ended and 
differential SCSI devices in the same domain, to achieve 
the fill I 16-device count, to add and remove bus seg- 
ments without shutting down the entire domain, and to 
achieve differential performance without incurring the 
extra cost of differential. Bridging expanders offer the 
same transmission isolation as simple expanders and 
may allow increasing the number of devices in the 
domain to as high as 946, 5 but bridging expanders are 
not as well developed as simple expanders and will not 
be explored in depth in this paper. 

Note that the improvement in signal integrity is dra- 
matic when using expanders with backplane applica- 
tions. Therefore, it is good practice to use an expander 
whenever connecting a SCSI cable to a backplane that 
contains SCSI devices. 

Smaller, Improved Interconnect 

Another recent development in parallel SCSI technol- 
ogy is the introduction of much smaller external phys- 
ical interconnects and more capable internal device 
interconnects. The SCSI connectors and shielded 



cables have historically been large, bulky, and generally 
difficult to manage. 

Spearheaded by activities that began in 1995 in the 
SFF (formerly Small Form Factor) industry group, 
standardization is under way of two new connector 
families that offer unprecedented levels of functionality 
and true multisourcing of complete connectors for 
parallel SCSI. These families are the Very High Density 
Cabled Interconnect (VHDCI) 2 shielded connectors 
that reduce the overall size of an external connector by 
tAvo thirds and the Single Connector Attachment-2 
(SCA-2) 7 unshielded connectors that integrate into 
a single connector all the functions needed to run a 
peripheral. The VHDCI family revolutionizes the 
external SCSI interconnect and the controller parts of 
the internal SCSI interconnect; the SCA-2 family does 
the same for the internal device interface. 

For the first time, complete connectors — not just 
the mating interface — are being standardized. This 
feature is essential to achieving interchangeability and 
second sourcing for connectors with the same style of 
termination-side contact. The VHDCI family is speci- 
fied in 26 different forms, all with exactly the same 
mating interface, so that virtually any kind of device 
or cable assembly design can be accommodated. 
Interestingly, this array of choices for the connectors 
does not increase the complexity of the interconnect 
but rather opens up new ways for product developers 
to design products while maintaining a simple and 
physically interoperable separable connector interface. 
In fact, this ability to accommodate a variety of prod- 
uct design requirements without changing the separa- 
ble interface is one reason that SCSI is becoming less 
complicated. 

Similarly, the family of SCA-2 connectors for SCSI 
internal devices and cables is following the VHDCI 
standardization model, with a significant number of 
intermatable forms being standardized. These connec- 
tors offer the ability to bring all the SCSI signals, all the 
power and ground connections, and all the optional 
signals, such as IDs, spindle sync, and power fail, out of 
the device through a single unshielded connector. This 
feature dramatically shrinks the cost and complexity of 
interconnecting an array of SCSI devices. 

Using an SCA-2 connector, the device may be 
inserted into a backplane without using cables. If the 
SCA-2 and backplane combination is not used, a SCSI 
cable (50-pin or 68-pin conductor), a four-lead power 
cable for ground and power (5-V and 12-V), and one 
or more smaller cables for the IDs etc., are required for 
every device in the system. Each of these cables is 
routed differently, has different current carrying and 
other electrical requirements, and has very different 
connectors. Although this cabled option is flexible and 
offers significant advantages in some systems, it is usu- 
ally not the best solution in the device array and mod- 
ular packaging applications that are required for the 
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higher-end applications. Therefore, the SCA-2 is a sig- 
nificant factor in the dramatic reduction in complexity 
of higher-end SCSI device applications. 

VHDG Connectors 

The physical size of the VHDCI connectors is much 
smaller than the earlier versions, as seen in Figure 8. 
Because of its low profile, the VHOCI 68-pin family is 
approximately half the height and twice the width of the 
latest Fibre Channel external connector, the High- 
Speed Serial Data Connector (HSSDC). Figure 9 shows 
a comparison of the VHDCI and HSSDC connectors. 
The same panel space is required for either technology. 

The VHDC1 connectors shown in Figure 9 are 
closely spaced, but the orientation of the polarizing 
shield connection is 180 degrees different between the 
upper and lower connectors. This arrangement allows 
an offset cable assembly to be used where one side is Hat. 
This same cable assembly may be used on both the 



upper and lower connectors without interference. The 
specifications of the VHDCI interface ensure that 
neighboring PC option slots will not have interference 
even if all the SCSI ports have cable assemblies attached. 

The VHDCI connector is useful for multiport appli- 
cations such as RAID (redundant array of inexpensive 
disks) controllers. Figure 1 0 shows examples in which 
the wide version of the connector family has allowed at 
least a doubling of the number of ports possible in a 
single controller form factor. As illustrated in Figure 
10, the device design enables up to four wide SCSI 
ports on a single PC option card cutout. 

The VHDCI retention scheme is also significantly 
simplified by introducing a three-way retention post 
for the bulkhead connector. This post accepts ( 1 ) the 
conventional jackscrews, (2) a squeeze-to-release clip 
for positive retention with rapid release, or (3) a detent 
ring retention that requires a stronger pull than that 
required with no retention but no action other than 
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Four Wide SCSI Ports on a Single PC Option Card 



pulling or pushing. The choice of retention type is 
made in the cable assembly. All 68-pin VHOCI cable 
assemblies that comply with the SFF specifications 
work on all 68-pin VHDCI mating connectors. 

Figure 11 shows the details of rile 68-pin VHDCI 
system. The lip in the jack post provides the securing 
point for squeeze-to-release clips and for split-ring 
detent retention. The center of the jack post is threaded 
for use with jackscrews. 

Although smaller than the high-density connector, 
the VHOCI connector is durable. It has no pins that 
can bend; its retention scheme uses the same-size 
jackscrcw thread as the high-density wide connector; 




and its contacts are imbedded in the housing where 
they cannot move or become distorted. 

SCA-2 Connectors 

The SCA family uses an 80-position, leaf-style contact 
to interface all active SCSI lines, three power voltages, 
and device control signals. This connector is consider- 
ably smaller than the collection of the three different 
connectors used for power, options, and SCSI bus in a 
cabled system. There are two basic versions of SCA 
connectors: SCA-1 and SCA-2. Both versions are 
unshielded and useful only within shielded enclosures. 
The SCA-1 has 80 positions with all contacts designed 
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Overall View of the 68-pin VH DCI System 
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ro be the same length. The SCA-2 can be mated to the 
SCA-1 but lias advanced grounding contacts and 
sequenced signal and power contacts for supporting 
hot plugging and blind mating (no visual feedback 
during mating). Both versions are available in manv 
styles, which differ by the termination-side structure 
and overall orientations. 

The SCA- 1 is not a documented standard and is 
being replaced by the SCA-2. The SCA-2 connector 
was introduced to SFF in 1995 7 as the first step toward 
formal standardization. 

Two special features exist in the SCA-2 connector. 
First, two contacts, one on each side of the connector, 
provide the first make/last break for the ground con- 
nection. This design ensures that a common electrical 
ground is established between the device and the sys- 
tem before any power or signal connections are made 
on device insertion. Upon removal, these contacts 
ensure that the ground stays intact throughout the 
disengagement of the signal and power pins. 

The second feature allows the special long power 
contacts to precharge bypass capacitors before: the 
main power contacts make. This reduces the distur- 
bance to the power distribution system and eliminates 
any arcing on the service power pins. Two pins at the 
extreme ends of the connector indicate that the con- 
nector is fully mated. The overall view of the SCA-2 
system is shown in Figure 1 2. 

The size of the connectors in the SCA family has not 
decreased dramatically. The connectors need to main- 
tain enough size to achieve blind mating alignment, 
and, for backplane applications, there is little advan- 
tage in having a connector that is smaller than the 
device. With 89-mm (3.5-inch [inj) or the newly pro- 
posed 76- mm (3-in) form factor devices, the SCA con- 
nector comfortabiv fits within the device boundaries. 



The use of backplanes for direct device attachment 
is possible because all the electrical connections for the 
device are available in one connector on the device. 
This design eliminates the cables used to attach the 
device and the space required for the connectors, thus 
significantly shrinking the size required to package 
multiple devices. 

External SCSI Cable 

The external cable for SCSI is shrinking also, through 
the use of smaller-gauge wire, better dielectrics, and 
Jess jacketing material, as illustrated in Figure 13. 
Formerly, wide SCSI required a cable of approximate! v 
12.70 mm (0.50 in) in diameter (a 126.677-mm 2 
[ 0 . 1 9 6 - i n 2 J cross section) with 28-gauge wire. Today, 
wide SCSI cables with 30-gauge wire are shipping 
with diameters of 9.40 mm (0.37 in) (69.398-mm 2 
[0.107-iir| cross sections). Cables with 7.62-mm 
(0.30-in) diameters (45.61-mm 2 [0.07-iir| cross 
sections) are possible with 32 -gauge wire and inexpen- 
sive dielectrics for wide SCSI. Cables with 6.35-mm 
(0.25-in) diameters (4.987-mnr [0.049-in 2 ] cross sec- 
tions) for narrow SCSI (45.61-mm 2 [0.07-in 2 J cross 
sections) are flexible and manageable — similar in size 
and flexibility to a desktop computer power cord and 
smaller than many serial cables. When used with active 
single-ended, LVD, or H VD terminators, the 32-gauge 
wire is adequate for distributing terminator power and 
SCSI signals in most applications. Long cables should 
not be used for terminator power distribution. 

Further reductions in the connector and cable sizes 
need to be weighed against the ease of handling, the 
need for sufficient strength to survive normal service 
stresses, and the cost increases at very small sizes. The 
combination of the VHDCI connector and 30/32- 
gauge wire sizes is a good optimization. 
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Overall View of the SCA-2 Connector System 



Digital Technical Journal 



Vol.9 No. 3 1997 




30/32-GAUGE NARROW 



Figure 13 

External SCSI Cable Diameters 

Summary of the Benefits Derived from a Smaller, 
Improved Interconnect 

The VHDCI connector and smaller cables combine 
to offer a robust yet user-friendly revolution in SCSI 
interconnect. The leaf-style contact of the SCA con- 
nector eliminates problems with bent pins that fre- 
quently bedevil the older wide SCSI connector. 
The ability to use up to four wide UltraSCSI ports in 
a single PCI option slot increases the SCSI connec- 
tivity per PCI slot to 60 devices (from 15 devices). 
By using multiple PCI slots, hundreds of SCSI devices 
can be connected to a single PC or workstation. 
In addition, the SCA-2 connector implements the 
essential contact sequencing required to perform SCSI 
device hot plugging. 

Device Insertion and Removal Bus Transients 

The multidrop feature of the SCSI bus allows device 
removal and replacement without disturbing the commu- 
nications between other SCSI devices, if the electricaJ dis- 
turbances caused by the device being added or removed 
are not detected by any other SCSI devices. Thus, it is 
architecturally possible to dynamically reconfigure the 
device population without interrupting existing data 
transmission processes between operational dewces. 

The transients involved with device insertion and 
removal include mechanical vibrations, power distrib- 
ution instabilities, SCSI terminator power noise, elec- 
trostatic discharge (radiation and induced current), 
and SCSI signal line noise. All except the SCSI signal 
line noise and the terminator power noise are handled 
by the storage system design and therefore are not 
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directly part of the advancements in parallel SCSI. 
The SCSI terminator power noise is determined by 
the size of the decoupling used on the SCSI termina- 
tors and the size of the capacitance on the device 
being inserted. This noise is easily controlled by 
ensuring that these sizes meet the values specified in 
theSPI standard. 4 

The delicate case is when the SCSI signal lines arc 
involved, which is the subject of this section. To deter- 
mine the nature and magnitude of these signal line dis- 
turbances, one must understand the following three 
mechanisms: (1) the overall sequence of events, (2) 
the electrical dynamics of connector contacts when 
used in the SCSI application, and (3) the electrical 
consequences on the bus when the device makes/ 
breaks contact with the SCSI signal line. 

There are two sequences of interest: insertion and 
removal. The removal process is easy to grasp after the 
insertion process is understood. 

Single-ended Device Insertion 

The initial conditions considered for SCSI device 
insertion assume a SCSI device with its ground solidly 
and continuously connected to the ground of the 
SCSI bus. This connection is easily accomplished, for 
example, by using sequenced contacts where the 
device ground makes connection well before any 
signal connection. In this state, the SCSI device pins 
present a maximum fully discharged capacitance of 
approximately 25 picofarads (pF). After the device 
signal pin contacts the bus, this capacitance becomes 
charged (by extracting charge from the bus) to the 
voltage on the signal line at the time of the insertion. 
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These values range from approximately 3 V for 
negated lines to nearly 0 V for asserted lines. 

Since the SCSI device being inserted is logically off 
(i.e., there is no driver current), the only current that 
needs to flow is that required to charge the 25-pF 
capacitance. This is sharply different from many con- 
nections in electronics in which current flows 
through the contact after an electrical contact has 
been established. 

In the case where no bus voltage changes occur 
except as a result of the device insertion, the insertion 
transient begins with the initial contact and ends when 
there is no further bus voltage change with time (the 
steady state voltage), •ncc the device pin voltage 
reaches the steady state bus voltage, no further current 
flows through the contact. 

Therefore, once the device capacitance becomes 
charged to the steady state signal line voltage, no further 
disturbances to the signal line voltage will occur even if 
the contact opens momentarily during a chattering 
event. The voltage on the device capacitance changes 
during the transient from a discharged state (zero volt- 
age) to the steady state signal line voltage, with the cur- 
rent always flowing into the device capacitance. 

If the signal line voltage changes after the insertion 
transient is completed (because of events such as being 
driven by other devices, by noise, or by the inserting 
device beginning to use its own driver), then current 
will again begin to flow through the contact. This is a 
normal SCSI condition for contacts in service. If the 
signal line voltage changes during the insertion tran- 
sient because of events other than the connector con- 
tact effects (e.g., signals changing because of being 
driven by other devices, other noise), then it is more 
difficult to determine exactly where the insertion tran- 
sient ends. The beginning of the insertion transient 
will still be marked by a charging of the device capaci- 
tance. Examples presented later in this paper show 
both insertion events and driv ing events from other 
devices occurring at the same tiine. 

The time required for complete contact mating on 
all SCSI signals in the bus is up to six orders of magni- 
tude greater than the time required for a SCSI signal to 
change state. Therefore, signal level changes are likely 
during the insertion process. The electrical behavior of 
the contact as it continues wiping (sliding after initial 
contact is made) from its initial contact point to its 
final resting position becomes a critical part of the 
process. The following subsections explore this behav- 
ior in detail. 

Connector Insertion Dynamics The data presented in 
this section were derived from a DIGITAL DSSI bus in 
1990. The DSSI bus is nearly identical to the SCSI bus, 
and many of the results apply without modification 
to SCSI. Similar data have been observed on the SCSI 
bus, but the complete set of data presented in this 



paper is not presently available from actual SCSI hard- 
ware. The disturbances in the DSSI bus are larger than 
those seen in the SCSI bus, because the DSSI voltages 
are slightly higher (3.5 V for DSSI compared to 2.8 V 
for single-ended SCSI), and the instrumentation 
capacitance (-10 pF) adds significantly to the device 
capacitance because of the state #f the art for scope 
probing in 1990. Numerous tests with modern scope 
probes (0.6 pF or less) of SCSI hardware have shown 
that the SCSI disturbances are indeed qualitatively the 
same but significantly less in size than those shown 
here from the DSSI hardware. 

The mechanisms described apply to any system in 
which the insertion transient is caused by the charging 
of a small capacitance. Figure 14 shows the basic test 
setup. A device is inserted into a connector with scope 
probes attached on either side of the mating interface 
and with an additional probe attached to the bus some 
distance from the connector. The voltage on the 
device side of the connector is used as the trigger 
signal into a digital storage scope so that the events 
before, during, and after the mating event can be 
examined. This is clearly a single-event type of mea- 
surement, so a high sampling rate (1 billion samples 
per second) and significant scope memory is required 
to capture the wav eforms. The scope probes used have 
a 1 -megohm input resistance. 

The connector used for the tests in this section has 
multiple parallel pins that all mate and demate in the 
same general time period. There is no intentional 
difference m the pin lengths. The time relationship 
between the mating events on two neighboring pins 
was explored first. By choosing neighboring pins, the 
differences between the pins is kept to a minimum 
so the time differences observed should represent the 
best pin-to-pin synchronization in a mating event. 

For this test, a probe was attached to each of two 
pins, and the connector alone (not part of a device) 
was mated to the bus segment connector. Figure 15 
shows the results. 

Both pins appear to show instantaneous transitions 
between the charged and discharged states on the time 
scale that was required to capture both events on the 
same plot. The mating events are separated by approx- 
imately 19 milliseconds, and there is no evidence of 
any discharging after the initial charging has occurred. 
Since the scope probes have a 1 -megohm input resis- 
tance, any lack of contact during the wipe portion of 
the mating will allow the capacitance to discharge 
through the probe with a time constant of approxi- 
mately UC. where R is the scope probe resistance and 
C7 is the sum ©f the connector pin and probe capaci- 
tances. Assuming a total of 10 pF, this gives a decay 
time constant of 10 microseconds. 

Figure 16 shows another mating event on pin 1 at 
a 500 times more sensitive time scale. In this case, 
some evidence of momentary opens is seen with the 
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Figure 15 

Time Relationship between Mating Events for Two Pins in the Same Connector 
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Figure 16 

Contact Bounce Events 

expected decay dynamics. The actual time constant is a 
bit longer than 10 microseconds because of some 
capacitance in the connector pin. This bounce behav- 
ior may or may not be present during the initial stages 
of the event shown in Figure 15, but clearly the behav- 
ior is not visible in the figure. To observe the suite of 
transients that exist in the mating process, one must 
examine the transients at several different time scales. 
In general, this requires repeating the mating events, 
since the dynamic range of the scopes used was insuffi- 
cient to capture all the detail in a single event. 



The initial mating event on pin 1 still appears to be 
instantaneous on the time scale used in Figure 16, but 
some slope is visible in the second bounce event. Also, 
during the second decay period, a shelf in the decay 
indicates that a partial, high-resistance contact was 
briefly experienced. Pin 2 is not close to making a con- 
tact at the time range shown in Figure 16. The figure 
shows a small amount of cross talk in the pin 2 voltage 
waveform caused by the pin 1 transients. 

This data clearly shows that the details of the mat- 
ing process are highly complicated and intrinsically 
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unpredictable. Therefore, the best we can hope for is 
to establish some limiting eases for the important 
parameters. The limiting features shown in Figure 16 
arc the extremely rapid initial mating event and the 
decay times. We examine these rapid transients in 
detail later in this section. The decav times are deter- 
mined by the actual contact resistance and the resis- 
tance of the leakage path to local ground. For normal 
SCSI devices, there is very little leakage to ground on 
the device pin so the opens produced by the bounce 
have no effect. 

Some cases observed indicate much more complex 
bounce structures. Figure 1 7 shows a case in which the 
mating connection is not established until more than 
700 microseconds have passed. 

The data in Figures 15 through 17 were all acquired 
from the same connector contact during separate mating 
processes. Typically, the details of the mating event arc 
very diff erent even under nominally identical conditions. 

Another type of mating event is shown in Figure 18. 
This event requires approximately 10 microseconds to 
make the transition from uncharged to charged, and 
there is no bounce. This particular event produces 



almost no cross talk into pin 2. Events with these char- 
acteristics are somewhat rare and are called gradual 
transients in this paper. 

Figure 19 shows a closer look at the rapid transient 
type of mating event. In this Hgure, we have added a 
device capacitance of approximately 20 pF to the scope 
probe for a total of approximately 30 pF. Notice that 
the transient requires 2 to 3 ns to substantially com- 
plete its charging. There is a ratio of nearly 10 7 between 
the mating events on different pins in the same connec- 
tor and the rapid transient of a single contact. 

Limiting Parameters for the Rapid Transient The 

question of whether the rapid transient shown in 
Figure 19 is the worst case needs to be explored 
because the duration of the transient affects the distur- 
bance on the bus. Some bounding features and some 
implications of the observed behavior of the rapid 
transient arc n#rcd in this subsection. 

Assuming that the transient event occurs in 2 ns and 
that the velocity of impingement just prior to the first 
mating event is 1 meter per second, then the distance 
traveled by the contact would be 2 nanometers (nm). 
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Extended Mating Bounce Events 
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Figure 18 

Gradual Mating Event, No Bounce 
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Figure 19 

Derailed Structure of the Rapid Transient 



This distance is equivalent to a few atomic distances. 
The distance traveled during the gradual transient 
shown in Figure 18 would be approximately 10 
microns, and during the extended bouncy case shown 
in Figure 17, approximately 1 mm. The velocity for the 
latter two cases would lilcely be somewhat reduced 
because of the mechanical interference between 
the pins, and the actual distance traveled is probably 
significantly less. There is little opportunity, however, 
for the velocity to be reduced for the rapid transient, 
and this distance of 2 nm is probably at least the 
correct order of magnitude. 

The following calculation shows the total current 
levels required to charge the capacitance in 2 ns. 

0= GT=30x 10" ,2 pFx3.5 V 
= 10.5 x 10" coulombs, 

where Q represents the total charge, C is the capaci- 
tance, and V is the voltage. Since this charge is trans- 
ferred in a time t of 2 ns, the average current is 

Q/(= 10.5 x 10 N coulombs/(2x 10" ns) 
= 52.5 miljiamperes (mA). 

For a gradual transient that takes 10 microseconds, 
the average current is approximately 10 microamps. 
These calculations show that the most severe ampli- 
tude disruption to the signal on the bus occurs with 
the rapid transients, since relatively large current must 
be supplied in a short time to charge the capacitor. 

The next item to be examined is the current density 
that must exist during the transient. Since the contacts 
move only 2 nm and the surface finish of actual con- 
tacts is not nearly this smooth, it is reasonable to 
assume a square 2-nm contact. Clearly, this assump- 
tion is not rigorously defensible and could be the sub- 
ject of an entire study area in its own right; however, 
there is no basis for assuming that the lateral contact 
region would be any different than the contact area in 
the mating direction. The basic conclusions would not 
be affected even if we assumed a hundredfold lateral 



increase in contact area. Attempts to use scanning 
electron microscopy to examine the actual contact 
area were not fruitful in establishing the actual initial 
physical contact area because of the severe physical dis- 
ruption that occurs on the microscopic level and 
because of the small sizes involved. 

Under these assumptions, the physical contact area 
is assumed to be (2 nm) 2 or 4 x 10 14 cm 2 in the follow- 
ing calculations. The current density to support the 
50-mA rapid transient current is therefore approxi- 
mately 10 12 A/cm 2 . Typical current densities in copper 
and other metals are less than 10 6 A/cm 2 . The electro- 
migration onset current is of this same order. The cur- 
rent density in the rapid SCSI transient is a million 
times greater than that which metal can support. 

To support the massive current density, the actual 
contact area must be much larger than the initial phys- 
ical contact area assumed in the above calculations. 
The author believes that this can be explained by a 
micromolten metal-to-metal joint that is formed upon 
initiaJ contact and that the front of the melt propagates 
(probably through phonon interaction) at approxi- 
mately the speed of sound in the metal. This process 
would create crudely a thousandfold increase in the 
effective insertion velocity and would result in a mil- 
lionfold increase in contact area, since the melt would 
propagate in all directions. 

This mechanism would produce reasonable current 
densities and would form an intimate metal -to- metal 
interface with both contacts that would aid in reduc- 
ing the contact resistance. The micromelt size becomes 
rapidly self-limiting, with the expanding contact area 
causing decreased current density, which in turn, causes 
decreased melt temperature. 

As discussed in the next section, the actual contact 
resistance during the rapid transient cannot be large. 
If this resistance is large, as in the case of the gradual 
transient, the mating event is much less disruptive. 

Many variations on the mating transients can be 
observed, but we do not attempt to show all of them 
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in this paper. One special variation, however, is worth 
noting — the combination of rapid and gradual tran- 
sients in the same mating process. Sometimes the mat- 
ing process starts with a gradual transient and then 
shifts to a rapid transient. Figure 20 shows a complex 
mating process in which (1) a gradual transient initi- 
ates, (2) a rapid transient starts but does not complete, 
(3) the rapid transient ends, (4) another gradual tran- 
sient process starts, and (5) another rapid transient 
finishes the charging process. 

This observation is consistent with several possible 
microprocesses during which the initial rapid transient 
extinguishes before completion. 

■ The micromelt becomes phvsicallv torn apart by 
the advancing motion of the contacts. (This process 
is unlikely because of the e.vcessixclv slow physical 
morion.) 

■ The micromelt explodes. (This process is likely.) 

■ The micromelt becomes resistive through the cont- 
amination of the melt with insulating material. 

■ The micromelt front reaches a thin region and 
opens because of the lack of material. 

■ The micromelt front reaches an insulating region. 

On further movement of the contacts, a new rapid 
transient condition is encountered between different 
metallic peaks of the contacts, and a new rapid tran- 
sient begins. Figure 21 shows a conceptual representa- 
tion of this process. 

Gradual transients appear to be associated with nor- 
mal current densities (i.e., 10 6 A/cm') and much 
higher contact resistance than rapid transients. In cases 
where a micromelt is not initiated, the low contact 
resistance associated with the liquid metal-to-solid 
metal interface and the expanded contact area are not 
present. Therefore, one way to eliminate the mating 
disturbance caused by the rapid transients is to ensure 
that a micromelting process is not possible. 



In the process shown in Figure 20, it is probable that 
a gradual-type contact is being maintained somewhere 
else in the contact, since no voltage decay is evident 
when the rapid transient ends. Indeed, it is to be 
expected that the rapid transient mechanism would 
not operate after the capacitance is charged to a certain 
level, since there would not be enough energy differ- 
ence to initiate and sustain a rapid transient. Therefore, 
the gradual transient is the behavior derived from an 
extrapolation of the normal mechanisms that produce 
contact resistance. This detailed discussion is pursued 
because we must understand the basic physical mecha- 
nisms to gain confidence that we are considering the 
worst-case disturbances. 

Single-ended Device Removal 

During the process of removal, the device pin separates 
from the bus. Since both the bus and the device are at 
the same voltage just before the separation, no current 
is flowing unless the bus voltage changes when the con- 
tact is in the process of separating. Therefore, in most 
cases the separation process causes no disturbance. 

Bounce can occasionally be observed during the 
demating process when there is a leakage-to-ground 
path present on the device side. Of course, if a voltage 
decay occurs and the contacts re-connect, the mecha- 
nisms are essentially the same as for the insertion tran- 
sient. The kev point is that no additional mechanisms 
have been noted for device removal that could be 
more disruptive than those operating during the i nser- 
tion process. In the limit, the removal process could 
produce as much disruption as the insertion process. 

Figure 22 shows two examples of demating. The 
demating events shown in Figure 22a have only approxi- 
mately a 60-microsecond separation. This separation is 
exceptionally small, and it is theoretically possible to have 
coincidental contact-to-contact events (within the preci- 
sion of the instrumentation). The demating event with 
bounce shown in Figure 22b was acquired on exactly the 
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Figure 21 

Architecture of Combination Gradual/Rapid Mating Event 

same pins in exactly the same connector used for the 
events in the top of the figure, and there is no evidence of 
any activity on pin 2. Pin 2 demated long before any 
activity was seen on pin 1. Again, this underscores the 
unpredictability of the detaiJs of any given event. 

Impact of Device Insertion and Removal on Bus Signals 

This section contains several examples of the noise 
produced on the bus side of the connector. Actual 
devices with approximately 25 pF of capacitance were 
used to obtain the data. This capacitance value is 
increased by the probe capacitance. On the bus side, 
there is also some increased capacitance caused by the 
probe used to acquire the bus side signal. Figure 23 
shows the basic impact of a rapid transient on the bus 
side of the connector and the time relationship of the 



bus disturbance to the voltage on the device side. The 
bus voltage is reduced while it supplies the necessary 
charge to the device pin. After the device capacitance is 
charged, the bus resumes its voltage level before the 
insertion transient (more or less). 

In this test, the bus pulse is approximately 3-ns wide 
at its midpoint; its peak amplitude is approximately 
1.25 V. This pulse is significantly larger in amplitude 
than that produced from a device alone. 

One of the more interesting features of the signals 
in Figure 23 is the lack of commonality or tracking in 
the signals after the rapid transient has passed. In the 
simplest interpretation, one would expect both sides 
of the connector to have nearly the same voltage 
(at the least to be within the accuracy of the 0.1 -ns 
propagation time between the probes). The following 
discussion addresses the author's current thinking on 
the reasons for this lack of tracking. 

Instrumentation effects, such as resonance or differ- 
ences in probe properties, were ruled out by using 
both probes on the same signal and noting that there 
was little difference in the signals reported from each 
channel. Later, typically after a few microseconds, the 
voltages do become effectively the same. 

Because a significant voltage difference is present for 
relatively long times, there must be a significant voltage 
source between the contacts to support this observed 
difference. In the initial stages, the difference between 
the pin voltages is approximately 3 V. If the current is 
the one calculated in the section Limiting Parameters 
for the Rapid Transient, that is, approximately 50 mA, 
then the current-limiting impedance must be at least 
3/0.05 = 60 ohms. This impedance, coupled with the 
parasitic capacitances and inductances, serves to blunt 
the instantaneous electrical energy transfers that would 
be implied by a very low source impedance. If the 
source impedance were very low, then both sides 
would have to track shortly a tier the i nitial contact. 

Part of this limiting impedance is the loaded or local 
transmission line impedance of the bus. The characteris- 
tic impedance is nominally approximately 100 ohms for 
an unloaded bus. Since die bus connector is attached to 
the middle of the line, both sides are available to supply 
charge and the effective charging impedance would 
be approximately 50 ohms. A 30-pF capacitance would 
have a charging time constant of 1.5 ns. This time con- 
stant fits the observations well during the rapid transient 
itself but does not (it the timing parameters of the vol t- 
age differences observed well after the rapid transient. 

Elevated local temperatures are almost certainly pre- 
sent during the rapid transient (near the melting point 
of die metals!), so it seems plausible that the mystery 
voltage source is basically thermal electromotive force 
(EMF) between the pins. Allowing a few microseconds 
to achieve thermal equilibrium and subsequent loss 
of the thermal EMF aJso seems quite plausible. These 
details are inviting further detailed investigation but 
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Most Severe Noise Pulse Observed 



do not affect die practical conclusions as applied to 
parallel SCSI. 

As added evidence for thermal effects, experiments 
with early LVD SCSI devices that use a 1 .2-V bus level 
instead of the 3.5- V bus level shown in Figure 23 
transfer much less energy and have a much shorter 
settling time before both sides of the contact track. 
These LVD results will be reported separately. 



The point extracted from these charging-impedance 
and settling-time observations is simply that the over- 
all energy transfer rate is limited by the microphysics 
of the process. This means that Figure 23 almost cer- 
tainly illustrates the worst-case disturbances. 

It has been noted thatthc bus pulse is similar to that pro- 
duced by a stub on the bus and a signal with a fast rise/fall 
time. In a sense, we really are charging a stub in either case, 
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and in both cases the loaded or loca] characteiistic imped- 
ance of die bus limits the extent of the disturbance. 

To more accurately measure the noise pulse pro- 
duced when a device is added to the bus, measurements 
were performed without a scope probe attached to the 
device pin. To do so required triggering the scope from 
the noise pulse on the bus side. Consequently, it was 
not possible to sec the device-side charging dynamics. 
Figure 24 shows the measured pulse near the device 
connector and at a point 2 meters away. 

The pulse measured in Figure 24 has approximately 
half the amplitude of the pulse in Figure 23. This is 
more reduction in amplitude than one would expect 
from the removal of 10 pF from the effective device 
capacitance, and this difference, while not completely 
explained, is in the Favorable direction. The noise pulse 
that reached the next device (where it could be 
detected as an error) would be even smaller, because of 
the dispersion and attenuation in the bus and because 
the neighboring device would need to have its 25-pF 
capacitance charged also. The signal at the measure- 
ment point 2 meters away in Figure 24 indicates 
the intensity of the attenuation and dispersion to be 
expected in the rapid transient bus pulses. The details 



of the attenuation and dispersion depend somewhat 
on the bus media used. 

The rapid transient bus pulses arc shown on actual 
data pulses in Figure 25. The top trace in the figure 
shows a rapid transient pulse on a negated part of a 
single-ended SCSI signal. There is a scope probe on this 
device, but the device capacitance is only approximately 
1 5 pF so the totaJ with the probe is approximately 25 pF. 
Note that the noise pulse is approximately 0.8 V and 
does not take the signal into the receiver detection 
range below 2 V. This negated state is a bit higher than 
usually found, so the bus pulse is starting from a higher 
point. If the pulse had started from a lower point, for 
example about 2.5 V, the pulse amplitude would not 
have been as large. Further discussion of the receiver 
detection range appears later in this section. 

The signals in Figure 25 were purposely chosen 
to have broad falling edges of approximately 15 ns. 
Normal SCSI signals arc 5 ns or faster. The broad edges 
maximize the chance that the bus pulse will produce a 
signal slope reversal of the type that can produce mul- 
tiple edges. The bottom trace in Figure 25 shows a bus 
pulse in the mostsensitivc part of the falling edge. This 
pulse produces almost no slope reversal because bv the 
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time it is ready to become positive-going, the data sig- 
nal has fallen so much that there is no voltage source 
to drive the signal more positive. At the beginning of 
the falling edge, the slew rate is increased bv the bus 
pulse; in the middle, the edge is extended and conse- 
quently the overall time required for the falling edge is 
almost exactly the same as for the filling edge that has 
no bus pulse (see the top trace). 

Therefore, the main effect of rapid transient pulses 
occurs when they intersect the signal edges (where a 
state change is expected anywav), and the effect is 
movement of the position of the edge by no more 
than 2 ns from the normal position. This movement is 
already accounted for in the SCSI standard as /)u/se 
distortion skew, so there is no important effect. 

If the mating event happens while the bus signal is 
in the asserted state, there is little effect since little 
charge is transferred. If the event happens in the rising 
edge, there may not be enough voltage difference to 
start a rapid transient — again, there is little effect. If a 
rapid transient is initiated on a rising edge, the impact 
is still a small shift in the position of the edge. In 
any arbitrary combination of signal level and type of 
transient, the bus disturbance will not be greater than 
those shown in Figure 24 and Figure 25. 

Differential 

For differential SCSI systems, essentially the same 
behavior occurs as for the single-ended case except that 
the relationship between two contacts instead of just 
one must be considered. If insertion transients on the 
positive signal differential line are occurring at the same 
time as transients on the negative signal line, we must 
examine the difference between these transients to 
see what impact they have on the differential signals. 
Based on the time required between mating events on 
neighboring connector pins presented in the section 
Connector Insertion Dynamics and in Figure 15, it is 
evident that the differential case is almost always two 
independent and isolated single-ended cases. This is 
because the difference in the time required for different 
pins in the same connector to begin the mating process 
vastly exceeds the actual transient time on either signal. 

In SCSI differential systems, both the positive and 
the negative signals are normally positive with respect 
to the local grounds. This means that the transients 
will be the same polarity on both signals. 

In the very rare cases in which some overlap exists 
between the transient times on both signals, the rapid 
transient disturbances would usually be seen as com- 
mon mode events that reduce the effective differential 
transient signal. These events arc not seen if common 
mode noise exists where the signals have opposite 
polarity with respect to local grounds during the 
transients. In this case, it is theoretically possible to 
produce anticommon mode differential transients. 



However, the anticommon mode case will always have 
the positive and negative signal lines within a differen- 
tial logical voltage level of ground, and the transients 
will therefore be small. Even in the anticommon mode 
case, the effect is at most a slight shift in the time when 
the differential state change is observed, since the tran- 
sient disturbances are so small. 

In the pathological differential case, large common 
mode levels exist on both the positive and the negative 
signals. The insertion transient will be larger because 
the bus voltage is larger. This case is even more rare since 
it requires both coincidental pin mating and coinci- 
dental large common mode. 

The other case considered that can have a unique 
effect on differential systems is that of extended bounce. 
This case extends the effective mating time to the point 
when some overlap between the transient activity on the 
pins is more likely. Recall that the extended bounce case 
was onlv visible when a leakage mechanism was available 
to discharge the incoming device capacitance. In actual 
devices, no significant leakage occurs so a bounce event 
does not produce disturbances after this first contact. 

The differential signal seen by the incoming device 
may be seriously affected by extended bounce if there 
is bus activity during this bounce. Consider, for exam- 
ple, a case in which the positive signal contact opened 
because of a bounce event after achieving a full charge, 
While it is still open, the negative signal changes state. 
Now both the positive and negative signals arc at the 
same nominal potential, which is an indeterminate dif- 
ferential condition. Fortunately, this condition is not a 
problem because the only device that sees this condi- 
tion is the device being inserted or removed and it is 
not in an operational state. 

Summary of the Handling of Device Insertion and 
Removal Transients 

After a complex, vet self-consistent, set of experimental 
data and interpretations, the concluding results are that 
the worst-case SCSI bus transients resulting from proper 
insertion and removal processes should not cause errors 
in the SCSI bus as presently specified in the SPI and 
Fast-20 (UltraSCSI) standards. The proper processes 
include pregrounding prior to insertion, avoiding 
excessive device capacitance, and using SCSI drivers and 
receivers that meet all of the SCSI requirements. 4 

As of this writing, all reports of device insertion/ 
removal errors have been traced back to failure to use 
proper procedures or designs. The most common errors 
are lack of pregrounding, devices that do not maintain 
the high-impedance input state during power cycling, 
and power distribution or mechanical transient effects 
unrelated to SCSI proper. 

The mechanisms that operate span a time spectrum 
from picoseconds in rapid transients to seconds in con- 
tact wipe and other macro connector operations. 
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The worst-case differential transients occur when 
one treats the differentia] system as two independent 
single-ended SCSI buses — one for the positive signal 
and one for the negative signal. 

The rapid transient becomes more and more 
detectable as bus speeds increase and the receivers and 
timing margins become more sensitive. Schemes to 
encourage the gradual transient are the best protec- 
tion against the ultimate problems caused by rapid 
transients. The best-known method for producing 
reliable gradual transients is to avoid a metal-to-metal 
contact during the initial contact and until the device 
capacitance is charged. At this time, no such connector 
system exists for SCSI applications. 

Overall Summary 

Evolution in four significant hardware technologies 
in the recent past has enabled parallel SCSI to break 
through the barriers that were preventing it from 
delivering excellent value, flexibility, and growth to the 
computer data storage industry. Application of more 
scientific methods, use of the latest silicon technology, 
and developments in the interconnect technology pro- 
vided the foundation for these improvements. DIGITAL 
provided most of die basic data and led important stan- 
dards and industry bodies to accomplish this. 
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Peter L. Higginson 
Michael C. Shand 

Development of Router 
Clusters to Provide Fast 
Failover in IP Networks 



IP networks do not normally provide fast 
failover mechanisms when IP routers fail or 
when links between hosts and routers break. 
In response to a customer request, a DIGITAL 
engineering team developed new protocols 
and mechanisms, as well as improvements to 
the DECNIS implementation, to provide a fast 
failover feature. The project achieved loss-of- 
service times below five seconds in response 
to any single failure while still allowing traffic 
to be shared between routers when there are 
no failures. 



A DIGITAL router engineering team has refined and 
extended routing protocols to guarantee a five-second 
maximum loss-of-scrvice time during a single failure in 
an Internet Protocol (IP) network. We use the term 
router cluster to describe our improved implementa- 
tion. A router cluster is defined as a group of routers 
on the same local area network (LAN), providing 
mutual backup. Router clusters have been in service 
since mid -199 5. 

Background 

The Digital Equipment Corporation Network Integration 
Server (DECNIS) bridge/router is a midrangc to 
high-end product designed and built by a DIGITAL 
Networks Product Business Group in Reading, U.K. 1 
The DECNIS performs high-speed routing of IP, 
DECnet, and OSI (open system interconnection) pro- 
tocols and can have the following network interfaces: 
Ethernet, EDDI (fiber distributed data interface), 
ATM (asynchronous transfer mode), HSSI (High- 
Speed Serial Interface), TI/E1 (digital transmission 
schemes), and lower-speed WAN (wide area network) 
interfaces. The DECNIS bridge/router is designed 
around a Future bus backplane, with a number of 
semi -autonomous line cards, a hardware based address 
lookup engine, and a central control processor respon- 
sible for the control protocols and route calculation. 
Data packets are normally handled completely by the 
line cards and go to the central processor only in 
exception cases. 

The DECNIS routers run a number of high-profile, 
high-availability, wide-area data networks for tele- 
phone service providers, stock exchanges, and chemi- 
cal companies, as well as forming the backbone of 
DIGITAL'S internal network. 

Typically, the DECNIS routers are deployed in 
redundant groups w ith diverse interconnections, to 
provide verv high availability. A common requirement 
is never to take the network down (i.e., during mainte- 
nance periods, connectivity is preserved but redun- 
dancy is reduced), 
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Overview 

IP is the most widely used protocol for communication 
between hosts. Routers (or gateways) arc used to link 
hosts that are not directly connected. When IP was 
originally designed, duplication of WAN links was com- 
mon but duplication of gateways for hosts was rare, and 
no mechanisms for avoiding failed routers or broken 
links between hosts and routers were developed. 

In 1994, we began a project to restrict loss-of- 
serviec times to below five seconds in response to any 
single failure; for example, failure of a router or its 
electrical supply, failure of a link between routers, or 
failure of the connection between the router and the 
LAN on which the host resides. In contrast, existing 
routing protocols have recovery times in the 30- to 
45 -second range, and bridging protocols are no bet- 
ter. Providing fast failover in IP networks required 
enhancements to many areas of the router's design to 
cover all the possible failure cases. It also required the 
invention of new protocols to support the host-router 
interaction under IP. This was achieved without 
requiring any changes to the host I P code. 

In this paper, we start by discussing our targets and 
the behavior of existing routing or bridging protocols 
and follow this with a detailed analysis of the different 
failure cases. We then show how we have modified the 
behavior of the routing control protocols to achieve 
the desired failover times on links between routers or 
in response to the failure of intermediate routers. 
Finally, we describe the new IP Standby Protocol and 
the mechanisms we developed to achieve fast recovery 
from failures on the LANs local to the end hosts. This 
part of the problem is the most challenging because 
the hosts are of many types and have IP implementa- 
tions that cannot realistically be changed. Thus all 
changes have to be made in the routers. 

Our secondary aims were to allow the use of router 
clusters in any existing network configuration, not to 
constrain failover to simple pairs of routers, to be able 
to share traffic betw een available routers, and to con- 
tinue to use the Internet Control Message Protocol 
(I CMP) redirect mechanism for optimum choice of 
router by hosts on a per destination basis. A common 
problem of hosts is that they do not time out redirects. 
This problem is avoided by the adoption mechanism 
within the router cluster. Having met these aims, as 
well as fast failover, we can justifiably call the result 
router clusters. 

The Customer Challenge 

A particular customer, a telecommunications service 
provider, has an Intelligent Services Network applica- 
tion by which voice calls can be transferred to another 
operator at a different location. The data network 



manages the transferral and passes information about 
the call. The application uses User Datagram Protocol 
(UDP) packets in IP with retransmission from the 
application itself. 

Because this application requires a high level of data 
network availability, network designers planned a 
duplicate network with many paired links and some 
mesh connections. Particular problems arise when the 
human initiator becomes impatient if there are delays; 
however, the more critical requirement was one over 
which the network designers had no control. The 
source of the calls is another system that makes a single 
high-level retransmission after five seconds. If that 
retransmission does not receive a response, the whole 
system at the site is assumed to have failed. This leads 
to new calls being routed to other service sites or sup- 
pliers, and manual intervention is required. 

To resolve this issue, the customer requested a 
networking system that would recover from a single 
foil urc in any link, interface, or router within a five- 
second period. The standard test (which both the cus- 
tomer and we use) is to start a once-per-second ping, 
and to expect to drop no more than four consecutive 
ping packets (or their responses) upon any event. The 
five-second maximum break also has to apply to any 
disruption when the failed component recovers. 

To meet the customer challenge, the router group 
in Reading developed the router cluster implementa- 
tion on the DECNTS. In the next two sections, we dis- 
cuss the bridging and routing protocols in use at the 
start of our project and relate our analysis of the cus- 
tomer's network problems. 

Bridging and Routing Default Recovery Times 

In a large network, a routing control protocol is essen- 
tial in order to dynamically determine the topology of 
the network and to detect failing links. Bridging con- 
trol protocols may be used similarly in smaller net- 
works or may be used in combination with routing. 

Bridging and routing control protocols often have 
failure recovery times in the order of a minute or more. 
A typical recovery consists of a detect time during 
which adjacent routers learn about the failure; a distrib- 
ution time during which the knowledge is shared, pos- 
sibly throughout the whole network; and a route 
recalculation time during which a new set of routes is 
calculated and passed to the forwarding engine. 

Detection times are in the order of tens of seconds; 
for example, 30 seconds is a common default. The two 
most popular Jink-state routing control protocols 
in large IP networks are Open Shortest Path First 
(OSPF) 2 and Integrated Intermediate System-to- 
Intcrmediatc System ( Integrated IS-IS)/ These proto- 
cols have distribution "hold downs" (to limit the 
impact of route flaps) to prevent the generation of a 
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new control message within some interval (typically 5 
or 30 seconds) of a previous one. The distribution of 
the new information is rapid ( typical Iv less than one 
second), depending primarily on link speeds and 
network diameter; however, the distribution may be 
adversely affected by transmission errors which require 
retransmission. The default retransmission times after 
packet loss vary between 2 and 10 seconds. The route 
recalculation typically takes less than one second. 
These values result in total recovery times after failures 
(for routing protocols with default settings) in the 
45- to90-second range. 

Distance vector routing protocols, such as the 
Routing Information Protocol (RIP), 4 typically take 
even longer to recover, partly because the route com- 
putation process is inherently distributed and requires 
multiple protocol exchanges to reach convergence, 
and partly because their timer settings tend to be fixed 
at relatively long settings. Consequently, their use is 
not further considered in this paper. 

Similarly, bridging protocols, as standard, use a 15- 
sccond timer; one of the worst-case recovery situations 
requires three timeouts, making 45 seconds in all. 
Another bridging recovery case requires an unsolicited 
data packet from a host and this results in an indeter- 
minate time, although a timeout will cause flooding 
after a period. 

In IP protocols, there is no simple wav for a host to 
detect the failure of its gateway; nor is it simple tor a 
router to detect the failure to communicate with a 
host. In the former case, several minutes may pass 
before an Address Resolution Protocol (ARP) entry 
times out and an alternative gateway is chosen; for 
some implementations, recovery may be impossible 
without manual intervention. Failure to communicate 
with a host may be the result of failure of the host 
itself, which is outside the scope of this project. 
Alternatively, it may be due to failure of the LAN, or 
the router's LAN interface. In this case, there exists an 
alternative route to the LAN through another router, 
but the routing protocols will not make use of it unless 
the subnet(s) on the LAN arc declared unreachable. 
This requires either manual intervention or timely 
detection of the LAN failure by the router. 

Analysis of the Failure Cases 

The first task in meeting the customer's challenge was 
to analyze the various failure and recoverv modes and 
determine which existing management parameters 
could be tuned to improve recovery times. After that, 
new protocols and mechanisms could be designed to 
fill the remaining shortcomings. 

A typical network configuration is shown in Figure 1 . 
The target network is similar but has more sites and 
manv more hosts on each LAN. Manv of the site 
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Typical Configuration for Router Cluster Use 



routers are DECNIS 500 routers with one or two 
WAN links and two Ethernets. The second Ethernet is 
used as a management rail and as a redundant local 
path between routers one and two (R1-R2) and 
between routers three and four (R3-R4). 

In the original plans for the customer network, the 
router cloud consisted of groups of routers at two or 
three central sites and pairs of links to the host sites. In 
designing our solution, however, we tried to allow any 
number of routers on each LAN, interconnected by a 
general mesh network. For test purposes, both we and 
the customer used this set-up with direct R1-R3 and 
K2-R4 Tl links as the network cloud. 

VVc have to consider what happens to packets travel- 
ing in each direction during a failure: there is little gain 
in delivering the data and losing the acknowledg- 
ments. Since the direction of data flow does not give 
rise to additional complications in the network cloud, 
there are just two failure cases: 

1 . Failure of a router in the network cloud 

2. Failure of a link in the network cloud 

We keep these cases distinct because the failure and 
recovery mechanisms are slightly different. 

We also need to consider a failure local to one of the 
LANs on which the hosts are attached. A failure here 
has two consequences: (1) The packets originated by 
the host must be sent to a different router, and ( 2 ) The 
response packets from the other host through the net- 
work cloud must also be sent to a different router, so 
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that it can send them to the host. We break down this 
type of failure into the following three cases: 

3. Packets from the host to a failed or disconnected 
router 

4. Packets to the host when the router fails 

5. Packets to the host when the router interface fails 

Note that we are using the term rmiter interface 
failure to include cases in which the connector falls 
out or some failure occurs in the IAN local to the 
router (such that the router can detect it). In practice, 
failure of an interface is rare. (Removing the plug is 
not particularly common in real networks but is easy 
to test.) Figure 2 shows these failure cases; this con- 
figuration was also used for some of the testing. 

Recovery of a link that previously failed causes no 
problems because the routers will not attempt to use it 
until after it has been detected as being available. Prior 
to that, they have alternate paths available. Recovery 
of a failed router can cause problems because the 
router may receive traffic before it has acquired suffi- 
cient network topology to forward the traffic cor- 
rectly. Recovery of a router is discussed more fully in 
the section on Interface Delay. 

Can Existing Bridging or Routing Protocols Achieve 
5 -Second Fa Hover in a Network Cloud? 

In this section, we discuss the failure of a router and the 
failure of a link in the network cloud (cases 1 and 2). 



The customer requested enhanced routing, and the 
existing network was a large routed WAN, so enhanc- 
ing bridging was never seriously considered. Our expe- 
rience has shown that the 1 5 -second bridge timers can 
be reduced only in small, tightly controlled networks 
and not in large WANs. Consequently, bridging is 
unsuitable for fast fai lover in large networks. 

For link-state routing control protocols such as 
OSPF and Integrated IS-IS, once a failure has been 
detected recovery takes place in two overlapping 
phases: a Hood phase in which information about the 
failure is diso'ibuted to all routers, and a route calcula- 
tion phase in which each router works out the new 
routes. The protocols have been designed so that only 
locaJ failures have to be detected and manageable para- 
meters control the speed of detection. 

Detection of failure is achieved by exchanging Hello 
messages on a regular basis with neighboring routers. 
Since the connections are usually IAN or Point-to- 
Point Protocol (PPP) (i.e., with no link-layer acknowl- 
edgments), a number of messages must be missed 
before the adjacency to the neighbor is lost. The mes- 
sages used to maintain the adjacency are independent 
of other traffic (and in a design like the DECNIS may 
be the only traffic that the control processor sees). 
Typical default values are messages at three-second 
intervals and 10 lost for a failure, but it is possible to 
reduce these. 
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FAILURE CASES 

1 Failure of a router in the network cloud 

2. Failure of a link in the network cloud 

3. Packets from the host to a failed or disconnected router 

4. Packets to the host when the router fails 

5. Packets to the host when the router interlace fails 



Figure 2 

Diagram of Failure Cases Targeted for Recovery 
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Decreasing the Routing Timers 

The default timer values arc chosen to reduce over- 
heads, to cover short outages, and to ensure that it is 
not possible for long packets to cause the adjacency to 
expire accidentally by blocking Hello transmission. 
(Note transmission of a 4,500-bytc packet on a 64 
kilobit-per-sccond link takes half a second, and queu- 
ing would normally require more than a packet time. ) 
However, with high-quality Tl or higher link speeds 
in the target network and priority queuing of Helios in 
the DECNIS, it is acceptable to send the Helios at one- 
second intervals and count three missed as a failure. 
(Although we have successfully tested counts of two, 
we do not recommend that value for customers on 
WAN links because a single link error combined with a 
delay due to a long data packet would cause a spurious 
failure to be detected.) The settings of one second and 
three repeats were within the existing permitted 
ranges for the routing protocols. 

When these shorter timers arc used, it is important 
that any LANs in the network should not be over- 
loaded to the extent that transmissions arc delayed. 
The network managers should monitor WAN links 
and disable any links that have high error rates. Given 
the duplication of routes, it is better to disable and ini- 
tiate repairs to a bad link than to continue a poor ser- 
vice. Many customers, with less contro lled networks 
and less aggressive recovery targets, have adopted the 
router cluster system but kept to more conservative 
timers (such as 1 second and 10 repeats). 

Implementation and Testing Issues 

In some cases, a failed link may be detected at a lower 
level (e.g., modem signals or FDD I station manage- 
ment) well before the routing protocol realizes that it 
has stopped getting Helios and declares the adjacency 
lost. (This can lead to good results during testing, but 
it is essential also to test link-failure modes that are not 
detected by lower levels.) In the worst case, however, 
both the detection of a failed router or the detection of 
a failed link rely on the adjacency loss and so have the 
same timings. 

Loss of an adjacency causes a router to issue a 
revised (set of) link-state messages reflecting its new- 
view of the local topology. These link-state messages 
arc flooded throughout the network and cause every 
router in the network to recalculate its route tables. 
However, because the two or more routers will nor- 
mally time out the adjacency at different times, one 
message arrives first and causes a premature recalcula- 
tion of the tables. Therefore it may require a subse- 
quent recalculation of the route tables before a new 
two-way path can be utilized. We had to tunc the 
router implementation to make sure that subsequent 
recalculations were done in a speedy manner. 

During initial testing of these parameters, we discov- 
ered that failure of certain routers represented a more 



serious case. However discussion of this is deferred to 
the later section The Designated Router Problem. 

Our target five seconds is made up of three seconds 
for the failure to be detected, leaving two seconds for 
the information about the failure to be flooded to all 
routers and for them to recalculate their routes. 
Within the segment of the network where the recov- 
ery is required, this has been achieved (with some tun- 
ing of the software). 

Recovery from Failures on the LANs Local to the 
End Hosts 

The previous section shows that we can deal with router 
failure a nd link failure in the network cloud (cases 1 and 
2). Here we consider cases 3, 4, and 5, those that deal 
with failures on the LANs local to the end hosts. 

From the point of view of other routers, a failed 
router on a LAN (case 4) is identical to a failed router in 
the network cloud (case 1): a router has died, and the 
other routers need to route around it. Failure case 4 
therefore is remedied by the timer adjustments 
described in the previous section. Note that these timer 
adjustments arc an integral part of the LAN solution, 
because they allow the returning traffic to be re-routed. 
These timer adjustments cannot work properly if the 
LAN parts of router clusters are using an inappropriate 
routing control protocol such as RIP 4 , which takes up 
to 90 seconds to recover from failures. 

Detecting LAN Failure at the Router 

A solution to case 5 — packets to the host when the 
router interface fails — for TP requires that the router 
can detect a failure of its interface (for example, that 
the plug has been removed). If the LAN is an FDDI, 
this is trivial and virtually instantaneous because con- 
tinuous signals on the ring indicate that it is working 
and the interface directly signals failure. For Ethernet, 
we faced a number of problems, partly due to our imple- 
mentation and partly due to the nature of Ethernet itself 
We formed a small team to work on this problem alone. 

Because of the variety of Ethernet interfaces that might 
be attached, there is no direct indication of failure: only an 
indirect one by failure to successfully transmit a packet 
within a one-second interval. For maximum speed, the 
DECNIS implementation queues a ring of eight buffers 
on the transmit interface and docs not check for errors 
until a ring slot is about to be reused. This means that an 
error is only detected some time after it has occurred, con- 
suming much of our fi ve -second budget. 

The control software in the DECNIS management 
processor has no direct knowledge of data traffic 
because it passes directly between the line cards. 
Therefore it sends test packets at regular intervals to 
find out if the interface has failed. By sending large test 
packets occupying many buffers, it ensures that the 
ring circulates and errors are detected. Initially, we 
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reduced the rimers and increased the frequency of rest 
packets to be able to detect interface failure within 
three seconds. (The test packets have the sender as 
destination so that no one receives them and, as usual, 
more than one failure to transmit is required before 
the interface is declared unusable. ) 

This initial solution caused several problems when it 
was deployed to a wider customer group; we had more 
complaints than previously about the bandwidth con- 
sumed by the test messages and, more seriously, a 
number of instances of previously working networks 
being reported as unusable. These problem networks 
were either exceptionally busy or had some otherwise 
undetected hardware problem. Over time, the net- 
works with hardware problems were fixed, and we 
modified the timers to avoid false triggering on very 
busy networks. Clearly, the three-second target 
required more thought. 

Several enhancements have since been made. First, 
the timers are user configurable so that the network 
managers can trade off between aggressive recovery 
times, bandwidth used, and false detection. Second, 
the test packet generator takes into account other 
packets sent by the control processor such that they 
are onjv sent to the size and extent required for the 
totaJ traffic to cause the ring to circulate. This is a sig- 
nificant improvement because the aggressive routing 
timers discussed previously cause Hello packets to be 
sent at one-second intervals, which is often sufficient 
not to require extra test packets. Third, the line card 
provides extra feedback to the control program about 
packets received and the transmission of packets not 
originated by the control processor. This feedback 
gives an indication of successful operation even if some 
transmits are failing. 

Re-routing Host Traffic When a Router or Router 
Connection Fails 

Case 3 was by far the most difficult problem to solve. 
IP does not provide a standard mechanism to re-route 
host traffic when a router fails, and the only method 
in common use (snooping RIP messages in the 
hosts) is both "deprecated" by the RPCs and has fixed 
45 -second timers that exceed our recovery target. 
Customers have a wide range of IP implementations 
on their hosts, and reliance on nonstandard features is 
difficult. The particular target application for this work 
ran on personal computer systems with a third-party 
IP stack, and we obtained a copy for testing. Such IP 
stacks sometimes do not have sophisticated recov ery 
schemes and discussion with various experts led us to 
believe that we should not rely on any co-operation 
from the hosts. 

Among other objectives, we wanted to be inde- 
pendent of the routing control protocol in use (if any), 
to permit both a mesh style of networking and more 



than two routers in a cluster, and to continue to route 
traf fic by reasonably optimal routes. In addition, we 
wished to not confuse network management protocols 
about the true identity of the routers involved and, 
if possible, to share traffic over the WAN links where 
appropriate. 

Electing a Primary Router 

In our solution, the first requirement is for other 
routers on the LAN to detect that a router has failed or 
become disconnected, and to have a primary router 
elected to organize recovery. This is achieved bv all 
routers broadcasting packets (called IP Standby 
Helios) to other routers on the LAN every second. 
The highest priority (with the highest IP address 
breaking ties) router becomes the primary router, and 
failure to receive IP Standby Helios from another 
router for ;/ seconds (three is the default) causes it to 
be regarded as disconnected. This condition may 
cause the selection of a new primary router, which 
would then initiate recovery to take traffic on behalf of 
the disconnected router. 

The IP Standby Helios are sent as "all routers multi- 
casts" and therefore do not add additional load to 
hosts. They are UDP datagrams" to a port we regis- 
tered for this purpose (digital-vrc; see the Internet 
Assigned Numbers Authority | IANA] on-line list). 
The routers are manually configured with a list of all 
routers in the cluster. To make configuration easier 
and less error prone, the list on each router includes 
itself, and hence an identical set of configuration para- 
meters can be used for all the routers in a cluster. 
Automatic configuration w as rejected because of the 
problem of knowing which other routers should exist. 

Function of the Primary Router in ARP Mode 

Our first attempt (called ARP Mode) uses a fake IP 
address (one per subnet for a LAN with multiple sub- 
nets), which the current primary router adopts and the 
hosts have configured as their default router. The pri- 
mary router returns its own media access control 
(MAC) address when the host broadcasts an A IIP 
request (using the standard ARP protocol 6 ) for the 
fake IP address and thus takes the traffic from the host. 
Altera failure, a newly elected primary router broad- 
casts an ARP request containing the information that 
the fake IP address is now associated with the new pri- 
mary router's MAC address. This causes the host to 
update its tables and to forward all traffic to the new 
primary router. 

The sending of ICMP redirects 7 by the routers has to 
be disabled in ARP mode. Redirects sent by a router 
would cause hosts to send traffic to an IP address other 
than the fake IP address controlled by the cluster, and 
recovery from failure of that router would then be 
impossible. Disabling redirects causes an additional 
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problem. If the primary router's WAN link foils, all t lie 
packets have to be inefficiently forwarded back over the 
LAN to other routers. To avoid this problem, we intro- 
duced the concept of monitored circuits, whereby the 
priority of a router to become the primary depends on 
the state of the WAN link. Thus, the primary router 
changes when the WAN link fails (or all the links fail if 
there are several), and the hosts send the packets to the 
new primary (whose WAN link is still intact). 

ART mode has a number of disadv antages. It does 
not necessarily use an optimum route when the WAN 
links form a mesh rather than the simple pair case, 
because redirects have to be disabled. The monitored 
circuit concept works only on the first hop from the 
router; more distant failures cannot change the IP 
Standby priority and may result in inefficient routing. 
Most seriously, die rules for hosts acting on information 
in ARP requests have only a "suggested implementa- 
tion" status in the llFCs,and we found several hosts that 
did not change when requested, or were very slow in 
doing so. (Note that we did consider broadcasting an 
ARP response, but there is no allowance in the specifica- 
tions for this message to be a broadcast packet, whereas 
an ARP request is normally a broadcast packet.) 

MAC Mode IP Standby (to Re-route Host Traffic) 

To solve these problems, we looked for a mechanism 
that did not rely on any host participation. The result 
was what we termed MAC mode. Here, each router 
uses its own IP address (or addresses for multiple sub- 
nets) but answers ARP requests with one of a group of 
special A/LAC addresses, configured for each router as 
part of the router cluster configuration. When a router 
fails or becomes disconnected, the primary (or the 
newly elected primary) router adopts the failed router. 
By adopt, we mean it responds to ARP requests for the 
failed router's IP address with the foiled router's spe- 
cial MAC address, and it receives and forwards all 
packets sent to the failed router's special MAC address 
(in addition to traffic sent to the primary router's own 
special MAC address and those of any other failed 
routers it has adopted). 

The immediate advantages of MAC" mode are that 
ICMP redirects can continue to be used, and, provid- 
ing the redirects are to routers in the cluster, the fast 
failover will continue to protect against further fail- 
ures. The mechanism is completely transparent to the 
host. In a cluster with more than two routers, the pri- 
mary router will use redirects to cause traffic (resulting 
from failure) to use other routers in the cluster if they 
have better routes to specific destinations. Thus multi- 
ple routers in a cluster and mesh networks are sup- 
ported. This also solves the problem of hosts not 
timing out redirects (an omission common to many IP 
implementations derived from BSD), because the redi- 
rected address has been adopted. 



In MAC mode, the hosts are configured with the IP 
address of any router in the cluster as the default gate- 
way. (The concept that it does not matter which router 
is chosen is one of the hardest for users to accept.) 
Some load sharing can be achieved by setting different 
addresses in different hosts. 

Since the DECNIS is a bridge router, it has the capa- 
bility to receiv e all packets on Ethernet and many MAC 
addresses on FDDI; thus all packets on all the special 
MAC addresses are seen by all routers in the cluster, 
and its own and. those of any adopted routers are for- 
warded. The special MAC addresses used arc those 
associated with the unused DECnet area 0. They are 
ideal because they are part of the locally administered 
group and have implementation efficiencies in the 
DECNIS because the DECnet hi-ord (AA-00-04-00 ) is 
already decoded, and they are 16 addresses differing in 
one nibble only (i.e., AA-00-04-00-0x-00, where x is 
the hexadecimal index of the router in the cluster). 
Note that ARP requests sent by the router must also 
contain the special MAC address in the source hard- 
ware address field of the ARP packet, otherwise the 
hosts' ARP tables may be updated to contain the wrong 
MAC address. 

MAC mode has minor disadvantages. Initially, it is 
easy to spread the load over a number of routers; how- 
ever, this can be lost alter redirects. In addition, a small 
chance of packet duplication exists during recovery 
because there mav be a short period when both 
routers are receiving on the same special MAC address 
(which does not happen in ARP mode because the 
host changes the MAC] address it is using). This is 
preferable to a period when no router is receiving on 
that address. 

Interface Delay 

Recently, we added an interface delay option to ame- 
liorate a situation more likely to occur in large net- 
works. In this situation, a router, rebooting after a 
power loss, a reboot, or a crash, reacquires its special 
MAC] address before it has received all of the routing 
updates from neighboring routers and thus drops 
packets sent to it (and worse, returns "unreachable" to 
the host). Typically, the main LAN initialization would 
be delayed for 30 seconds while routing table updates 
were received over the WAN interfaces and any other 
IAN interfaces. The backup continues to operate dur- 
ing this 30 seconds. (Note that with Integrated IS-IS, 
we could have delayed IP on the whole router, but we 
did not do this because it would not have worked for 
OSPF, which requires IP to do the updates.) We use a 
fixed configurable time rather than attempting to 
detect the end of updating, because determining com- 
pletion is difficult if the network is in a stare of flux or 
the router's WAN links are down. 
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Redirects and Hosts That Ignore Them 

When a router issues an ICMP redirect, die RFCs state 
that it must include its own IP address in the redirect 
packet. A host is required to ignore a redirect received 
from a router whose IP address is not the host's next 
hop address for the particular destination address. 
Therefore, it is necessary to ensure that the address of 
the failed router is correctly included when issuing a 
redirect on its behalf. In the DECNIS implementation, 
because the destination MAC address of a received 
packet is not available to the control processor, the pri- 
mary router cannot tell whether a redirect has to be 
issued on behalf of itself or one of the adopted routers. 
The primary router therefore issues multiple redi- 
rects — one for each adopted router (in addition to its 
own). Since redirects are rare, this is not a problem, 
but they could be avoided by passing the MAC desti- 
nation address oft he original packet (or just five bits to 
flag a special MAC address and say which it is) to the 
control processor. 

It is contrary to the basic IP rules for hosts to ignore 
redirects. 8 Despite the rules, some hosts do ignore 
redirects and continue sending traffic which has to be 
sent back over the same LAN. These cause problems in 
all networks because of the load, and, in the DECNIS 
implementation, because every time the line card rec- 
ognizes a redirect opportunity, it signals the control 
processor to consider sending a redirect. This may 
happen at data packet rates and is a severe load on the 
control processor, which slows down processing of 
routing updates and might then cause our five-second 
recovery target to be exceeded. 

To reduce the problems caused by hosts ignoring 
redirects, we improved the implementation to rate- 
limit the generation of redirect opportunity messages 
by the line cards. We also recommend in the docu- 
mentation that, where it is known that hosts ignore 
redirects (or their generation is not desired), the 
routers be connected by a lower-cost LAN than the 
main service LAN (such as the management LANs 
shown in Figure I). Normally, this would mean link- 
ing (just) the routers by a second Ethernet and setting 
its routing metric so that it is preferred to the main 
LAN for packets that would otherwise traverse back on 
the main LAN to the other router. This has two advan- 
tages. Such packets do not consume double band- 
width and cause congestion on the main LAN, and 
they pass only through the fast-path parts of the 
router, which are well able to handle full Ethernet 
bandwidth. 

In MAC mode, it is also possible to define a router 
that does not actually exist (but has an IP address and 
a special MAC address) and is adopted by another 
router, depending on the state of monitored WAN cir- 
cuits. Setting this as the default gateway is another way 
of coping with hosts that ignore redirects. 



Special Considerations for Bridges 

We do not recommend putting a bridge or layer 2 
switch between members of a router cluster, because 
during failover, action would be required from the 
bridge in order for the primary router to receive pack- 
ets that previously were not present on its side of the 
bridge. We cannot rely on this being the case, so we 
must have a way of allowing bridges to learn where the 
special MAC addresses currently are. More impor- 
tantly, if bridges do not know where the special MAC 
addresses are, they often use much less efficient (flood- 
ing) mechanisms. 

For greater traceability (and simpler implementa- 
tion), we use the router's real MAC address as the 
source address in data packets that it sources or for- 
wards. We use the special MAC address as the source 
address in the IP Standby Helios. Since the Hello is 
sent out as an IP multicast, it is seen by all bridges or 
switches in the local bridged network and causes them 
to learn the location of the address (whereas data pack- 
ets might not be seen by non-local bridges). Since we 
are sending the Helios every one second anyway, there 
is no extra overhead. 

When a primary router has adopted routers, it cycles 
the source MAC address used for sending its Hello 
between its own special MAC address and those of the 
adopted routers. We also send out an additional Hello 
immediately when we adopt a router to speed up 
recognition of the change. 

Since the same set of special MAC addresses is used 
by all router clusters, we were concerned that a bridge 
that was set up to bridge a non-IP protocol (e.g., local 
area transport [LAT]) but not to bridge IP, might be 
confused to see the same special MAC address on 
more than one port. (This has been observed to hap- 
pen accidentally, and the resultant meltdown has led 
us to avoid any risk, however slight, of this happen- 
ing.) Hence we make 16 special MAC addresses avail- 
able and recommend to users that they allocate them 
uniquely within a bridged domain, or at least use dis- 
joint sets on either side of a bridge. 

The Designated Router Problem 

While testing router failures, we discovered additional 
delays during recovery due to the way in which link- 
state protocols operated on LANs. In these cases, the 
failure of routers not handling the data packets can 
also result in interruption of service due to the control 
mechanisms used. 

For efficiency reasons in link-state routing proto- 
cols, when several routers are connected to a LAN, 
they elect a designated router and the routing proto- 
cols treat the LAN as having a single point-to-point 
connection between each real router and a pseudo 
router maintained by the designated router (rather 
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than connections between all the routers). The desig- 
nated router issues link-state packets on behalf of the 
pseudo router, showing it as having connections to 
each real router on the local LAN, and each router 
issues a link-state packet show ing connection to the 
pseudo router. This mechanism operates in a broadly 
similar way in both Integrated IS-IS and OS Ft'; the 
primary difference being that the OSPF election 
exhibits hysteresis, thus minimizing unnecessary des- 
ignated router changes. 

For routing table calculations, a transit path over the 
LAN is taken from a router to the pseudo router and 
then to another router on the LAN. Hence any change 
in pseudo router status disrupts calculation of the net- 
work map. 

When a designated router fails, a slew of updates 
occurs; each router on the LAN loses the adjacency to 
the old designated router and issues a new link-state 
packet. Next, the new designated router is elected 
(or in the case of OSPF, the backup designated router 
takes over), and each router issues a link-state packet 
showing a link to it. In parallel, the new designated 
router issues a set of (ink-state packets showing its 
connections. This is a new router on the network as 
far as the other routers are concerned; the old desig- 
nated router stays, disconnected, in the tables for as 
long as 20 minutes to an hour, This happens at level 1 
and at level 2 in Integrated IS-IS, resulting in twice 
as many updates. The interactions are complex; in 
general, thev result in the sending of multiple, new 
link-state messages. 

Apart from the pure distribution and processing 
problem of these updates and new link-state packets, 
there are deliberate delays added. A minor one is that 
updates in Integrated IS-LS arc rate-limited on LANs 
(to minimize the possibility of message loss). A major 
one is that a particular Jink-state packet cannot be 
updated within a holding time from a previous update 
(to limit the number of messages actually generated). 
The default holding time is 30 seconds in Integrated 
ISTS; it can be reduced to 1 second in the event we 
found that the best solution was to allow as many as 10 
updates in a 10-second period. The reason for this is 
that the first update usually contains information 
about the disconnection and it is highly desirable to 
get the update with the connection out as fast as possi- 
ble. In addition, in the wider network, an update can 
overtake and replace a previous one. 

With OSPF, the protocol defines a minimum hold- 
ing time off ive seconds, w hich limits the recovery time 
when the designated router fails. The target cus- 
tomer's network was using Integrated IS-IS, and so we 
were able to achieve the five-second recovery even 
when the designated router failed. (Note that with 
two routers, one must be the designated router so it is 



not a rare case.) We ha\e not, so far, felt that it is 
worthwhile to break the rules by allowing a shorter 
holding time for OSPF. 

Conclusions 

We successfully designed and implemented router 
clusters for the OECNIS router w ith shared workload 
and interruptions after failures •! less than five seconds 
in both LAN and WAN environments. This capability 
has been deployed in the product since the middle of 
1995. An Internet Engineering Task Force (IETF) 
group is currently attempting to produce a standard 
protocol to meet this need. 0 
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Shared Desktop: A 
Collaborative Tool for 
Sharing 3-D Applications 
among Different 
Window Systems 



The DIGITAL Workstations Group has designed a 
software application that supports the sharing 
of three-dimensional graphics and audio across 
the network. The Shared Desktop application 
enables the collaborative use of any application 
over local and long-distance networks, as well 
as interoperation among Windows- and UNIX- 
based computers. Its simple user interface 
employs screen capture and data compression 
techniques and a high-level protocol to transmit 
packets using TCP/IP over the Internet. 



An advanced product development effort undertaken 
bv graphics engineers in the DIGITAL Workstations 
Group led to the creation of a new software application 
called Shared Desktop. One project goal was to enable 
collaboration among users of three-dimensional (3-D) 
graphics workstations that run either the UNIX or the 
Windows NT operating system. Another goal was to 
allow these users to access the high-performance 3-D 
capabilities of their office workstations from their 
laptop computers or home-based personal computers 
(PC's) that run the Windows 95 system and do not 
have 3-D graphics hardware. This goal necessitated 
a cross-operating-system application that could effi- 
ciently and effectively handle 3-D graphics in real time 
and share these graphics with machines such as laptop 
computers and PC's. 

In this paper, we begin with a discussion of the software 
currcntlv available for computer collaboration. We then 
discuss the development of the Shared Desktop applica- 
tion, focusing on the user interface, protocol splitting, 
screen capture and data handling, and dissimilar frame 
buffers. We conclude with sections on additional uses and 
future directions of the Shared Desktop product. 

Current Collaboration Software 

Computer collaboration may be defined as the interac- 
tion between computers and their human users over a 
local or long-distance network. In general, it involves a 
transfer of textual, graphical, audible, and visual infor- 
mation from one collaborator to another. The parti- 
cipants share control information either by means 
of computci- generated synchronization events or by 
human voice and visual movement. 1 

Specifically, computer collaboration involves com- 
municating and sharing data among participants who 
can be located anvwherc in a building, a city , a country, 
or the w oiid. Each participant has either a PC, a work- 
station, or a laptop computer. Some machines contain 
3-D graphics adapters with hardware acceleration. 
( Computer-aided design/computer-assisted manu- 
facturing | CAD/CAM] applications like Parametric 
Technology Corporation's [PTC] Pro/ENGINEER 
use hardware accelerators through OpcnGL 2 or 
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Dircct31>* programming protocols.) Other computers 
do not contain 3-D accelerator boards and provide 3-D 
capabilities through software -only routines on two- 
dimensional ( 2 -D) hardware. In a typical collaboration, 
a person w anting to share a specific 3-D graphical dis- 
play of a part or model telephones others to discuss the 
design in progress. After the initial contact, the collab- 
orators may continue the telephone call or switch to 
the audio function of the application. The graphics part 
appears on each participant's screen along with associ- 
ated keyboard and mouse events. As the collaborators 
discuss the work, they may each interact with the dis- 
play to highlight, rotate, and change the look or design 
of the 3-D part. In this way, even though the partici- 
pants are separated by some distance, they may interact 
as if they were all sitting around a table working, con- 
versing, and designing the 3-D part. 

Current software that facilitates computer-based 
collaboration runs through a range of capabilities from 
the earliest forms of electronic mail to the most recent 
offerings of complete collaborative sharing of the 
computer. Examples include VVinFrame technology 
from Citrix Svstcms, Inc., NetMeeting from Microsoft 
Corporation, Netscape Communicator from Netscape 
Communications Corporation, and other products 
from Sun Microsystems, Hewlett-Packard, and Silicon 
Graphics Inc. These packages offer levels of computer 
sharing and collaboration from videoconferencing and 
file sharing to full application sharing. Each implemen- 
tation runs on specific operating systems. Although 
they use various underlying communication protocols, 
most recent designs work over local area and wide area 
networks (LANs/WANs), including the Internet. For 
example, the NetMeeting product provides confer- 
encing tools like chat, whiteboard, file transfer, audio 
and videoconferencing, and non-real-time, selected - 
window 2-D application sharing over T120 protocols 
layered on the Transmission Control Protocol/ 
Internet Protocol (TCP/IP). 4 NetMeeting runs only 
on Microsoft platforms (Windows 95 and Windows 
NT operating systems). The current products are defi- 
cient, however, in that they do not support multiple 
operating svstcms, do not operate in real time, and do 
not share 3-D graphics. 

User Interface 

In this section, we describe our choice of a simple user 
interface for the sharing area of a desktop and our 
design of the Shared Desktop Manager for client- 
server computing. 

Many collaboration tools for sharing computer 
information (graphical desktop, keyboard, mouse, and 
audio of a given computer) were complete systems and 
required too much effort on the part of the users just 
to learn how to share information. A focus on learning 
collaboration tools often requires users to become 



experts in the collaboration software rather than in the 
applications that they may share. Since the various 3-D 
graphics packages that needed to be shared were com- 
plicated in themselves, we decided to implement a 
simple user interface in the Shared Desktop applica- 
tion that ncarlv all audiences could easily learn and use. 

In the Shared Desktop design, \\c designated part 
of the desktop screen as a sharing area. Graphics 
objects such as icons and applications located within 
the sharing area can be accessed by all conference 
participants. To share a new application, a participant 
moves the application into the sharing area. To 
remove an application, a participant moves it outside 
the sharing area. If the sharing area encompasses the 
entire desktop of the initiating participant, all applica- 
tions are shared. We used standard pull-down menus 
and widgets provided by either the UNIX X Motif 
toolkit or the Microsoft Windows libraries. We named 
the sharing area the "viewport"; it is viewed on the 
desktop as a user-defined area of rectangular size and 
location. Any graphical object placed into the view- 
port is marked as shareable with client users in a col- 
laboration. We designed the viewport so that it is 
always on the bottom of a given stack of windows on a 
desktop. Thus, when Shared Desktop is minimized, so 
is its viewport. The objects that had been within the 
uewport are returned to the initiator's desktop and are 
no longer shared. With a quick minimization, the 
server collaborator can pause any sharing that was in 
progress without disconnecting from the client users. 

Figure 1 illustrates a UNIX server with a Shared 
Desktop viewport connected to several client svstcms. 
The server's viewport contains no shared objects 
within its confines, and each client screen shows a 
viewport received from the server. 

The viewport can be set to represent the entire visi- 
ble desktop, or it can be set to equal onlv the size of a 
given application on the screen. Accordingly, a user 
who is acting as the server can determine how much of 
a given desktop to share among the client collabora- 
tors. The concept of a viewport is valuable because the 
principal collaborator (at the server) can quickly glance 
at the screen and determine what to capture and send 
to other participants. (The objects and applications 
sent from the server are designated bv solid lines in 
Figure 1 .) The Shared Desktop application requires no 
further action to set up an application for sharing. 

Each client sends keyboard and mouse events to the 
server to control any application present in the view- 
port (remote control is shown as dashed lines in 
Figure 1 ). Server and clients synchronize cursor move- 
ments so that any conference member can watch 
as others make changes to a shared application. This 
allows the cursor to become a pointer during a session. 
Shared Desktop implements an "anarchy 11 form of 
remote control, with all mice and keyboards active 
simultaneously. 
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Figure 1 

Server Desktop with Viewport and Clients 

When a user initiates a collaboration, the audio is off 
by default but remains integral to a session as a conve- 
nience (as opposed to using the telephone). Through 
a pull-down menu operation, the server enables audio 
for all participants in one operation. The usual audio 
management tools used to set microphone recording 
levels and speaker/headset pJay-back levels are avail- 
able. As Figure 1 indicates, the UNIX machine collects 
audio and distributes it to the three client collabora- 
tors. Likewise, the three clients collect audio and send 
it back to the server for mixing. In this way, all partici- 
pants can hear one another and interact with whatever 
objects appear in the viewport on the server's screen. 

Figure 2 shows the Shared Desktop Manager from the 
initiator's viewport running on die UNIX server. A par- 
ticipant may use a Session pull-down menu to control the 
viewport and to connect and disconnect other confer- 
ence members. The Options menu allows for audio, 
remote cursor, and call-back control. The application's 
Help pull- down menu provides the usual help informa- 
tion similar to a Windows help facility or a Web browser's 
help. The window lists die status of attached clients. 
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Figure 2 

Shared Desktop Manager 



Upon connection, participants can hear and interact 
with the server. The resultant audio dialogue combined 
with the graphics, keyboard, and mouse interactions 
facilitate a collaboration environment in which partici- 
pants share an application. Since each user can operate 
a separate mouse and keyboard, the audio channel acts 
as a synchronization mechanism to indicate which col- 
laborator controls the shared applications at any given 
moment. The participants communicate their actions 
verbally, interacting in much the same way as people 
who are sitting around a table and working. 

Design Features 

For our implementation, we concentrated on three 
principal areas: protocol splitting, screen capture and 
data handling, and dissimilar frame buffers. In this sec- 
tion, we discuss our investigation into using a protocol 
splitter and our decision to rely on screen capture and 
data handling. We also discuss dissimilar frame buffers. 

Protocol Splitting 

We looked for a way to distribute 3-D graphics among 
workstations and PCs that would be independent of 
the application, graphics protocol, architecture, operat- 
ing system, and windowing system. On UNIX, we 
found application sharing provided by distributed win- 
dows protocols. For example, the X Protocol 5 allows a 
user to send an application to a nonlocal display and to 
send X applications protocol messages to several 
screens simultaneously. A protocol splitter, however, 
has disadvantages due to its requirements for band- 
width, programming, and latency. 
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Protocol splitters require distribution of graphics 
commands and display lists by means of a network. 
Three-dimensional models often contain megabytes of 
graphical information that describe specific screen 
operations. When displaying a model locally, these 
graphics operations move quickly and easily over sys- 
tem buses that are capable of handling hundreds of 
megabytes per second. However, when these same 
graphics objects are copied over computer networks, 
the amount of information can overload even the 
highest-speed networks. For example, using a 100- 
megabvte (MB) Pro/ENGINEER truck assembly, a 
current generation 3-D workstation can load, display, 
and rotate the truck once in approximately 2 minutes. 
The same operation between two identical 3-D work- 
stations takes 20 minutes when performed by a distrib- 
uted protocol, and the rotation of the truck does not 
appear fluid to the user. If the same data or application 
is duplicated on every machine, only updates with syn- 
chronizing events are distributed, but this requires that 
all machines have the same graphics hardware. 

The programming software needed for interopera- 
tion among dissimilar operating and windowing 
systems using protocol splitting is quite involved. 
The ability to support XI 1 desktops, Windows 95 
desktops, and Windows NT desktops while using mul- 
tiple 3-D protocols like OpenGL and Direct3D would 
require that these protocols exist on all platforms. 

Latency requirements for 3-D are very stringent. 
Thus, any network jitter makes even the best network 
link create breakup (visual distortions) when rotating 
3-D objects. Network jitter also causes delays in send- 
ing window protocol messages; as a delay increases, 
the window events may no longer be useful. For exam- 
ple, when rotating a 3-D object, the delayed events 
must propagate as the network permits although this 
may once again congest the network since the events 
may no longer be needed. The object has now rotated 
to a new view. The ability to drop some protocol mes- 
sages in a time-critical way is a requirement for collab- 
orating with 3-D objects, and the protocol splitter 
approach to sharing has no solution for this problem. 

Screen Capture and Data Handling 

To overcome these issues, we investigated capturing the 
screen display, the final bitmap result of the interaction 
of graphics hardware and software that the viewer sees. 
Capturing the screen is in itself nothing new; it has been 
used for some time to include screen visuals in docu- 
ment preparation. Initially, we were skeptical that cap- 
turing the screen display could be a useful mechanism 
since the amount of data on a screen can be prodigious. 
Screen graphics depth and resolution can make the 
amount of data in any given graphics object very large. 
For example, for a 24-plane frame buffer with a ] ,280 
by 1,024 resolution, the total amount of data to capture 
would be (24 x 1 ,280 x l,024)/8 or about 4 MB. Using 



the computational power of the Alpha microprocessor 
for reducing the data, we continued our investigation. 
We found that this approach requires the windowing 
system to perform screen capture by means of a non- 
CPU-intensive routine (direct memory access [DMA] 
as opposed to programmed I/O). Based on our tests, 
we concluded that screen capture technology would be 
easier to implement than a protocol splitter, would 
have better latency for 3-D operations than a protocol 
splitter, and would be easily adaptable to the various 
windowing systems and 3-D protocols we wished to 
have interoperate. 

Graphics Compression The screen capture approach 
requires a number of steps to efficiently prepare the 
data for transmission. First, the contents of a viewport 
are captured, and the sample is saved for comparison 
with successive samples. Second, the captured viewport 
samples are differenced to find screen pixels that have 
not been changed and delta values for those that have 
been changed. Third, the resultant array of values is 
compressed by a fast, run-length encoding (RLE) of 
the array of difference samples. A more CPU -intensive 
compression may now be applied. The fourth step is to 
apply LZ77 compression that reduces the remaining 
RLE data to its smallest form. In step four, the original 
data has been reduced while retaining its characteristics 
so that it can be restored (uncompressed) without loss 
on a receiving computer. This final lossless stage of 
compression occurs only if it reduces the amount of 
data and if the network was busy during a previous 
transmission. Lossless compression is important for the 
nondestructive transfer of data from the server's screen 
to the clients' screens and has application in industry. 
As an example, consider a doctor who is sharing an 
x-ray with an out-of-town colleague. If the graphics 
were compromised by a lossy compressor, the collabo- 
rators could not be guaranteed that the transmitted 
x-ray was identical to the one sent. With the Shared 
Desktop application, the doctor who is sending the 
x-ray is guaranteed that the original graphics are 
restored on the colleague's display. In some forms of 
compression, data is thrown out by the algorithm and 
never restored, so that the final screen data may not 
accuratelv reflect the original graphics. Figure 3 shows 
the steps in the capture and compression sequence. 

On the Alpha architecture, these compression steps 
are performed as 64-bit operations, both in the data 
manipulation and the compression algorithms. The 
Alpha architecture lends itself to a fast and efficient 
implementation of the algorithms, so that the capture 
of the viewport and the multistage compression of the 
data can be accomplished in real time. Approximately 
half of the number of instructions is used on a proces- 
sor that is twice as fast as a 32 -bit architecture. In addi- 
tion to its 64-bit routine, the RLE is implemented as a 
32-bit routine and as a comparison routine. 
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Figure 3 

Graphics and Audio Compression Data Flow Diagram 



Audio Compression Similar to the graphics compres- 
sion described, the audio compression in Shared 
Desktop involves several steps. First, the audio samples 
are captured through a microphone and sound card 
combination. These samples arc compared with the 
background noise level (determined prior ro beginning 
a conference) to see if the samples are useful. Samples 
below the background noise level are not transferred. 
This implements a silence detection method whereby 
only useful samples will advance to the next level 
of compression. Second, the next compression uses 
G.711 or other similar audio compression standards 
and converts adaptive differential pulse code modula 
rion (ADPOV1) samples at 64 kilobits per second into 
16 kilobits per second (4:1 lossy compression Third, 
this data is then ready for transfer to a receiving com 
puter so that it may be decompressed and output ro a 
speaker or a headset. The audio stream resulting from 
these steps generates at most 16 kilobits per second 
when someone is speaking, and no output when it is 
silent. Figure 3 also shows the audio compression steps. 

Data Transmission After the graphics and audio data 
are collected and compressed, they are combined and 
transmitted across the network bv a patented, higher- 
level protocol that ensures timely deliver)' of each 
packet." All packets are sent using TCP/IP over the 
Internet. Although the higher-level protocol does not 
guarantee true real-time characteristics, the patented 
protocol allows for coherent audio, synchronization 
of graphics and cursor events, and near real-time 
graphics animation. 

As an example, the screen capture shown in Figure 4 
displays a 100-MB Pro/KNGINHKll assembly being 
shared through the Shared Desktop application. The 
Shared Desktop Manager system (system where the 



assembly database resides) is an AlphaStation 500 
workstation running the DIGITAL UNIX operating 
system with a PowerStorm 4D60 graphics controller. 
In this example, an 800- bv 600-pixel by 24-bit Shared 
Desktop v ievv port is being captured, compressed, and 
transmitted to the Shared Desktop client system at 
about five updates per second. The update rate is 
determined by the capture view port size, the extent of 
detail changes between captures, the amount of pro 
cessing power needed bv the application to make 
changes to the model, and the speed of the network. 
In this example, when rorating the truck assembly, a 
compressed stream of 400 to 500 kilobytes per second 
is generated and represents the fiv e updates per second 
mentioned. A simple assembly might be able to do a 
rotation with Shared Desktop capturing and transmit- 
ting 15 updates per second, and a more complicated 
mode! 'like the truck assembly shown) would receive 
fewer updates per second. 

Dissimilar Frame Buffers 

To complete the requirements of our implementation, 
we needed to share graphics information across dis- 
similar hardware, i.e., machines with different graphic 
frame buffer depths. The frame buffer depth refers to 
the amount of storage the graphics adapter gives to 
each displayed pixel on the screen. A 16-bit-deep dis- 
play assigns each pixel a 16-bit value to represent the 
pixel. This representation is usually the color informa- 
tion for the pixel, i.e., what color the user sees for a 
given pixel.* The frame buffer depths are a necessary 
reality since different graphics devices have widely 
varving screen depths, ranging from 4 planes (4 bits 
per pixel) to 32 planes (32 bits per pixel). Typically, 
higher end graphics devices have higher-depth graph- 
ics outputs, especially for 3-D graphics, and the lower- 
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Figure 4 

Screen Capture of the Windows Shared Desktop Client Sharing Pro/ENGINEER with a UN IX Shared Desktop Server 



depth displays are usually found on less-capable, 2-D 
graphics platforms. Most laptop computers have low 
bit depth (8 to 16) displays and no 3-0 capabilities. 
Commodity PCs also typically have 8- or 16-plane 
depths. Graphics devices that support 3-0 graphics 
provide deeper display types such as 24-bit or 32-bit. 
Some devices support a mix of several or all the bit 
depths listed in the matrix (below) either concurrently 
or for the entire screen at one time. 

We defined a matrix of screen depths and proceeded 
to fill in the various combinations so that the applica- 
tion would work effectively across different platforms 
and graphics hardware capabilities. The matrix enables 
computers without 3-D capability to display the out- 
put from 3-D-capable graphics devices. The matrix of 
screen-depth combinations follows. 
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The matrix shows input screen or visual type depth 
across the top row and delineates output bitmap depth 
on the left column. Bitmap depths of 4, 8, 15, 16, 24, 
and 32 are used in Windows systems, and depths of 4, 
8, 12, 24, and 32 are used in XI 1. The .v in the matrix 
requires no conversion and is captured and displayed 
without the need for additional conversion. The c 
shows bitmap depths that can be expanded to the out- 
put format by using a colormap or by shifting pixels 
into the correct format. The d shows that information 
must be dithered to match the output. Dithering can 
result in a minimal loss of information, but we have 
developed a very good and efficient method of doing 
this conversion. The m (mix mode) marks those visual 
types on XI 1 that can exist on the screen when the 
root depth is 24 or 32; i.e., an 8-bit window can be 
present on a 24-bit display. The mix mode requires a 
different interpretation of the 24- bit pixels prior to 
compression and transmission. Since no 1 2-bit output 
displays exist, ;? marks inapplicable transformations. 
Alternate formats of 24 pixels (3 bytes per pixel and 
blue/green/red [BGR] triples) are supported as well 
as 8-bit pseudocolor and 8-bit true color. 

Sample Uses 

Like other collaboration softw are, the Shared Desktop 
application can be used in remote situations to help 
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people communicate and share data. These uses include 
telecommuting, debugging/support, and education. 

Telecommuting 

One feature we built into Shared Desktop is the ability 
to originate a sharing session from a remote location. 
Our intent was to allow an individual to work outside 
the office environment on a home PC or a laptop com- 
puter. In the telecommuting scenario, a workstation 
with high-end graphics functions and applications 
located in the office would call back the home user's 
low-end system and present the user with his work 
environment. For example, consider a user of PTC/s 
Pro/ENGINEER who is working on a 3-D assembly 
with a 100- MB database and must make a change to 
the part from home. Prior to the Shared Desktop 
application, the only options were either to mimic the 
work environment at home or drive to the office to 
make the change. To mimic a work environment, the 
equipment at home must support Pro/ENGINHER 
software and might require 3-D hardware. In addi- 
tion, the user would have to retrieve a recent version of 
the 100-MB database over the telephone lines, which 
would take many hours to copy. With the Shared 
Desktop application, the user can access the 100-MB 
part using the low-end computer over standard tele- 
phone lines. The changes to the assembly then occur 
on the system and to the large database :it the office. 

Remote Debugging/Support 

Another use of the Shared Desktop application is for 
customer support or remote debugging. Consider the 
user of a 3-D design application w ho discov ers a bug 
in a new version of the software. A complex model 
often causes a bug that requires software support to 
obtain the database to re-create the problem. Using 
Shared Desktop, a user could show a support repre- 
sentative the problem on the running application, as 
opposed to filing a problem report. 

Off -site Training 

A remote training scenario provides a final example of 
collaboration using computers. The Shared Desktop 
application facilitates remote training by connecting 
students in a sharing session. Each student's desktop 
displays a lesson composed of the course material 
installed on the instructor's desktop. Students inreract 
with the teacher by audio, mouse, and keyboard 
actions on objects in the screen viewport. In essence, 
the teacher uses the synchronized cursors to highlight 
or point to objects on the screen. 

Conclusion and Future Directions 

The Shared Desktop collaboration software employs 
a simple user interface that emphasizes ease of 3-D 
application sharing and audio conferencing. Compared 



to application sharing based on a protocol splitter, the 
Shared Desktop application offers easier interoper- 
ability and better latency during 3-D operations. With 
a protocol splitter approach, it is difficult to decide 
which, if any, graphics events to drop when network 
jitter or network bandwidth delays occur. Our 
approach is synchronized to the last screen capture. 
When the network is no longer congested, the current 
screen capture can be sent, thus minimizing the per- 
ceived effect of the network delay. The only disadvan- 
tage to bitmap sharing is its requirement that the 
windowing system and display driver implement a 
DMA screen capture function and not programmed 
I/O. DMA screen capture requests have a minimal 
load on the windowing system. 

We are planning a number of improvements to the 
advanced development version of Shared Desktop. 
In our initial work, we made no changes to the win- 
dowing systems. Ideally, the product version might 
have a mechanism that notifies an application when 
and where another application has made changes to 
the screen. With the added ability to capture only 
those areas of the screen that have changed since the 
last notification, the windowing svstem could perform 
the first two steps in the capture process. 

Although the compression scheme we implemented 
works for most cases, some graphics may not compress 
well using the combination of RLE and LZ77. Instead, 
content-specific compression or adaptive compression 
techniques might be better applied. This is an area of 
study we hope to pursue. 

The current graphical user interface (GUI) lacks 
some conferencing features. The product version will 
be packaged with other applications to provide video, 
chat, whiteboard, file transfer, and user locator/ 
directory services. 

Finally, the sharing model we implemented for the 
Shared Desktop application is easily ported to other 
systems. Thus the application could be available for 
widespread use. 
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David C. P. LaFrance- Linden 



Challenges in Designing 
an HPF Debugger 



High Performance Fortran (HPF) provides 
directive-based data-parallel extensions to 
Fortran 90. To achieve parallelism, DIGITAL'S HPF 
compiler transforms a user's program to run as 
several intercommunicating processes. The ulti- 
mate goal of an HPF debugger is to present the 
user with a single source-level view of the pro- 
gram at the control flow and data levels. Since 
pieces of the program are running in several dif- 
ferent processes, the task is to reconstruct the 
single control and data views. This paper pre- 
sents several of the challenges involved and 
how an experimental debugging technology, 
code-named Aardvark, successfully addresses 
many of them. 



As we learn better ways to express our thoughts in the 
form of computer programs and to take better advan- 
tage of hardware resources, wc incorporate these ideas 
and paradigms into the programming languages we 
use. Fortran 90'- provides mechanisms to operate 
directly on arrays, e.g., A = 2*A to double each element 
of A independent of rank, rather than requiring the 
programmer to operate on individual elements within 
nested DO loops. Manv of these mechanisms are natu- 
rally data parallel. High Performance Fortran (HPF) M 
extends Fortran 90 with data distribution directives to 
facilitate computations done in parallel. Debuggers, in 
turn, need to be enhanced to keep pace w ith new fea- 
tures of the languages. The mndamcntal user require- 
ment, however, remains the same: Present the control 
How of the program and its data in terms of the original 
source, independent of what the compiler has done or 
what is happening in the run-time support. Since HPF 
compilers automatically distribute data and computa- 
tion, thereby widening the gap between actual execu- 
tion and original source, meeting this requirement is 
both more important and more difficult. 

This paper describes several of the challenges HPF 
creates for a debugger and how an experimental debug- 
e;inti; technology, mternallv code-named Aardvark, sue- 
ccssfully addresses manv of them using techniques that 
ha\ e applicability beyond HPF. For example, program- 
ming paradigms common to explicit message-passing 
systems such as the Message Passing Interface (MPIf 1 
can benefit from Aardvark's methods. 

The HPF compiler and run time used is DIGITAL'S 
HPF compiler/ which produces an executable that 
uses the run-time support of DIGITAL'S Parallel 
Software Fjwironment /' DIGITAL'S HPF compiler 
transforms a program to run as several intercommuni- 
cating processes. The fundamental requirement, then, 
is to give the appearance of a single control How and a 
single data space, even though there are several indi- 
vidual control Hows and the data has been distributed. 
In the paper, I introduce the concept of logical entities 
and show how thev address many of the control flow 
challenges. A discussion of a rich and flexible data 
model that easily handles distributed data follows. I 
then point out difficulties imposed on user interfaces, 
especially when the program is not in a completely 
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consistent state, and indicate how they can be over- 
come. Sections on related work and the applicability of 
logical entities to other areas conclude the paper. 

Logical Entities 

From the programmer's perspective, an HPF program 
is a single process/thread with a single control flow 
represented by a single call stack consisting of single 
stack frames. A debugger should strive to present the 
program in terms of these single entities. A key 
enabling concept in the Aardvark debugger is the defi- 
nition of logical entities in addition to traditional phys- 
ical entities. Generally, a logical entity collects several 
phvsical entities into a single entity. Many parts of 
Aardvark arc unaware of whether or not an entity is 
logical or physical, and a debugger's user interface uses 
logical entities to present program state. 

A physical entity is something that exists some- 
where outside the debugger. A physical process exists 
within the operating system and has memory that can 
be read and written. A physical thread has registers 
and (through registers and process memory) a call 
stack. A physical stack frame has a program counter, a 
caller stack frame, and a callcc stack frame. Each of 
these has a representation within the debugger, but 
the actual entity exists outside the debugger. 

A logical entity is an abstraction that exists within die 
debugger. Logical entities generally group together 
several related physical entities and synthesize a single 
behavior from them. In C++ terms, a process is an 
abstract base class; physical and logical processes are 
derived classes. A logical process contains as data mem- 
bers a set of other (probably physical) processes. The 
methods of a logical process, e.g., to set a breakpoint, 
bring about the desired operations using logical algo- 
rithms rather than physical algorithms. The logical 
algorithms often work by invoking the same operation 
on the physical entities and constructing a logical entity 
from the physical pieces. This implies that some opera- 
tions on physical entities can be done in isolation from 
their logical containers. Aardvark makes a stronger 
statement: Physical entities are the building blocks for 
logical entities and arc first-class objects in their own 
right. This allows physical entities to be used for tradi- 
tional debugging without any additional structure." 1 

A positive consequence of this object-oriented design 
is that a user interface can often be unaware of the physi- 
cal or logical nature of the entities it is managing. For 
example, it can set a breakpoint in a process or navigate a 
thread's stack by calling virtual methods declared on the 
base classes. 

Some interesting design questions arise: What is a 
process? What is a thread? What is a stack frame? What 
operations arc expected to work on al l kinds of processes 
but actually only work on physical processes? Experience 
to date is inconclusive. Aardvark currentlv defines the 



base classes and methods for logical entities to include 
many things that are probably specific to physical enti- 
ties. This design was done largely for convenience. 

Sometimes a logical entity is little more than a com 
tainer of physical entities. A logical stack frame for 
threads that are in unrelated functions simply collects 
the unrelated physical stack frames. Nevertheless, logi- 
cal stack frames provide a consistent mechanism for 
collecting physical stack frames, and variants of logical 
stack frames can discriminate how coordinated the 
physical threads are. The concept of logical entities 
does not apply to all cases, though. Variables have val- 
ues, and there does not seem to be anything logical or 
physical about values. Yet, if a replicated variable's val- 
ues on different processors arc different, there is no 
single value and some mechanism is needed. Rather 
than define logical values, Aardvark provides a differ- 
ing values mechanism, which is discussed in a later sec- 
tion of the same name. 

Controlling an HPF Process 

Users want to be able to start and stop H PF programs, 
set breakpoints, and single step. From a user interface 
and the higher levels of Aardvark, these tasks are sim- 
ple to accomplish — ask the process or thread, which 
happens to be logical, to perform the operation. 
Within the logical process or thread, however, the 
complexity varies, depending on the operation. 

Starting and Stopping 

Starting and stopping a logical thread is straightfor- 
ward: Start or stop each component physical thread. 
Some race conditions require care in coding, though. 
For example, starting a logical thread theoretically starts 
all the corresponding physical tlareads simultaneously. 
In practice, Aardvark serializes the physical threads. In 
Figure 1 , when all the physical threads stop, the logical 
thread is declared to be stopped. Aardvark then starts 
the logical thread at time u +" and proceeds to start each 
phvsical thread. Suppose the first physical thread (thread 
0) stops immediately, at time "*." It might appear that 
the logical thread is now stopped because each physical 
thread is stopped. This scenario does not take into 
account that the other physical threads have not yet 
been started. Timestamping execution state transitions, 
i.e., ordering the events as observed by Aardvark, works 
well; a logical thread becomes stopped only when all its 
physical threads have stopped after the time that the 
logical thread was started. An added complexity is that 
some reasons for stopping a physical thread should stop 
the other physical threads and the logical thread. In this 
case, pending starts should be cancelled. 

Breakpoints 

Setting a breakpoint in a logical process sets a break- 
point in each physical process and collects the physical 
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Determining When a Logical Thread Stops 



breakpoint representations into a logical breakpoint. 
For HPF, any action or conditional expression is 
associated with the logical breakpoint, not with the 
physical breakpoints. Consider the expression 
ARRAY(3,4).l_T.5. Even if the element is stored in 
only one process, the entire HPF process needs to stop 
before the expression is evaluated; otherwise, there is 
the potential for incorrect data to be read or for 
processes to continue running when they shouJd not. 
This requires each physical process to reach its physical 
breakpoint before the expression can be evaluated. 
Once evaluated, the process remains stopped or con- 
tinues, depending on the result. For HPF, a break- 
point in a logical process implies a global barrier of the 
physical processes. 

Recognizing and processing a thread reaching a 
logical breakpoint is somewhat involved. Aardvark\s 
general mechanism for breakpoint determination is to 
ask the thread's operating svstcm model if the initial 
stop reason could be a breakpoint. If this is the case, 
the operating system model provides a comparison key 
for further processing. 

For physical DIGITAL UNIX threads, a SIGTRAP 
signal could be a breakpoint, with the comparison key 
being the program counter address of the potential 
breakpoint instruction. This comparison key is then 
used to search the breakpoints installed in the physical 
process to determine which (if any) breakpoint was 
reached. If a breakpoint was reached, the stop reason is 
updated to be "stopped at breakpoint." All this physi- 
cal processing happens before the logical algorithms 
have a chance to notice that the physical thread has 
stopped. Therefore, by the time Aardvark determines 
that a logical thread has stopped, any physical threads 
that arc stopped at a breakpoint have had their stop 
reasons updated. 

For a logical thread, the initial (logical) stop reason 
could be a breakpoint if each of the physical threads is 
stopped at a breakpoint, as shown in Figure 2. The 
comparison key in this case is the logical stop reason 
itself. The breakpoints of the component stop reasons 
are then compared to the component breakpoints of 
the installed logical breakpoints to determine if a logi- 
cal breakpoint was reached. If there is a match, the 
logical thread's stop reason is updated. 



Aardvark achieves the flexibility of vastly different 
types of comparison keys (machine addresses and logi- 
cal stop reasons) bv having the comparison key type be 
the most basic Aardvark base class, which is the equiv- 
alent of Java's Object class, and by using run-time 
typing as necessary. 

Single Stepping 

Single stepping a logical thread is accomplished by 
single stepping the physical threads. It is not sufficient 
to single step the first thread, wait for it to stop, and 
then proceed w ith the other threads. If the program 
statement requires communication, then the entire 
HPF program needs to be running to bring about the 
communication. This implies that single stepping is a 
tw o-part process — initiate and wait — and that the ini- 
tiation mechanism must be part of the exposed inter- 
face of threads. 

As background, running a thread in Aardvark 
involves continuing; the thread w ith a run reason. The 
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run reason is empowered to take the actions (e.g., set- 
ting or enabling temporary breakpoints) necessary to 
carry out its task. In this paper, the word empowered 
means that the reason has a method that will be called 
to do reason-specific actions to accomplish the rea- 
son's semantics. This relieves user interfaces and other 
clients from figuring out how to accomplish tasks. As a 
result, Aardvark defines a "get single-stepping run rea- 
son" method for threads. Clients use the resulting run 
reason to continue the thread, thereby initiating the 
single-step operation. 

Therefore, single stepping a logical thread in 
Aardvark involves calling the (logical) thread's "get 
single-stepping run reason" method, continuing the 
thread with the result, and waiting for the thread to 
stop. The "get single-stepping run reason" method for 
a logical thread in turn calls the "get single-stepping 
run reason" method of the component (physical) 
threads and collects the (physical) results into a logical 
single-stepping run reason. When invoked, the logical 
reason continues each physical thread with its corre- 
sponding physical reason. 

Single stepping dramatically demonstrates the 
autonomy of the physical entities. When continuing 
a (logical) thread with a (logical) single-stepping run 
reason, the physical threads can start, stop, and be 
continued asynchronously to each other and without 
any intervention from a user interface, the logical enti- 
ties, or other clients. This is especially true if the thread 
was stopped at a breakpoint. In this case, continuing 
a physical thread involves replacing the original 
instruction, machine single stepping, putting back the 
breakpoint instruction, and then continuing with the 
original run reason. Empowering run reasons (and 
stop reasons) to effect the necessary state transitions 
enables physical entities to be autonomous, thus 
relieving the logical algorithms from enormous poten- 
tial complexity. 

Coordinating Physical Entities 

The previous discussion describes some logical algo- 
rithms. The section "Starting and Stopping" describes 
using timestamps to determine when a logical thread 
becomes stopped (see Figure 1), and the section 
"Breakpoints" describes a logical thread possibly 
reaching a breakpoint (see Figure 2). The physical 
entities need to be coordinated so that the logical 
algorithms can be run. In Aardvark, this is done with a 
process change handler. A process change handler is a 
set of callbacks that a client registers with a process and 
its threads, allowing the client to be notified of state 
changes. For example, if a user interface is notified that 
a thread has stopped and that the reason is a UNIX 
signal, the user interface can look up the signal in a 
table to determine if it should continue the thread 
(possibly discarding the actual signal) or if it should 
keep the thread stopped. 



In the context of HPF, a user interface registers its 
process change handler with the logical HPF process. 
During construction of the logical process, Aardvark 
registers a physical-to-logical process change handler 
with the physical processes. It is this physical -to-logical 
handler that coordinates the physical entities. When the 
first physical thread stops, as at rime "*" in Figure 1 , the 
handler is notified but notices that the timestamps do 
not indicate that the logical thread should be considered 
to have stopped. When the last physical thread stops, 
the handler then synthesizes a "stopped at collection" 
logical stop reason, as in Figure 2, and informs the 
( logical ) thread that it has stopped. 

Aardvark defines some callbacks in process change 
handlers that are for HPF and other logical paradigms. 
These callbacks allow a user interface to implement 
policies when a thread or process goes into an interme- 
diate state. For example, at time "*" in Figure 1 a 
physical thread has stopped but the logical thread is 
not yet stopped. Whenever a physical thread stops, the 
handler's "component thread stopped" callback is 
invoked. A possible user interface policy is" 

■ If the component thread stopped for a nasty rea- 
son, such as an arithmetic error, try to stop all the 
other component threads immediately in order to 
minimize divergence among the physical entities. 

■ If this is the first component thread that stopped for 
a nice reason, such as reaching a breakpoint, start a 
tinier to wait for the other component threads to 
stop. If the timer goes off before all the other com- 
ponent threads have stopped, try to stop them 
because it looks like they are not going to stop on 
their own. 

■ If this is die last component thread, cancel any rimers. 

The user interface can provide the means for the user 
to define the timer interval, as well as other attributes 
of policies. These policies and their control mecha- 
nisms are not the responsibility of the debug engine. 

Examining an HPF Call Stack 

When an HPF program stops, the user wants to see a 
call stack that appears to be a single thread of control. 
Sometimes this is not possible, but even in those cases, a 
debugger can offer a fair amount of assistance. The HPF 
language provides some mechanisms that also need to 
be consid ere d . The EXTRINSIC(HPF_LOCAL) proce- 
dure type allows procedures written in Fortran 90 to 
operate on the local portion of distributed data. This 
type is usefiil for computational kernels that cannot be 
expressed in a data-parallel fashion and do not require 
communication. The EXTRINSIC(HPF_SERIAI_) 
procedure type allows data to be mapped to a single 
process that runs the procedure. This type is useful for 
calling inherently serial code, including user interfaces, 
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which may nor be written in Fortran. DIGITAL'S HPF 
compiler also supports iici/iuing. w hich allows serial 
code to call parallel HPF code. AJ1 these mechanisms 
affect the call stack or how a user navigates the call 
stack. They require underlying support from the 
debugger as well as user interface support. 

Logical Stack Frames 

Aardvark's logical entity model applies to stack frames: 
logical stack frames collect sev eral physical stack frames 
and present a svnthesized view of the (logical) call 
stack. Currently, Aardvark defines four types of logical 
stack frames to represent different scenarios that can 
be encountered: 

1. Scalar, in which only one physical thread is semanti- 
cal ly active 

2. Synchronized, in which all the threads are at the 
same place in the same function 

3. Unsynchronized , in which all the threads are in the 
same function but at different places 

4. Multi, in which no discernible relationship exists 
between the corresponding physical threads 

Aardvark's task is to discover the proper alignment 
of the physical frames of the physical threads, deter- 
mine which variant of logical frame to use in each case, 
and link them together into a call stack. Ideally, all log- 
ical frames are synchronized, which means that the 
program is in a well-defined state. This is true most of 
the time with HPF; the Single Program Multiple Data 
(SPMD) nature of HPF causes all threads to make the 
same procedure calls from the same place, and break- 
points are barriers causing the threads to stop at the 
same place. 

Aardvark's alignment process starts at the outer- 
most stack frames of the physical threads (the ones 
near the Fortran PROGRAM unit) and then progres- 
sively examines the callees (toward where the program 
stopped). Starting from the innermost frames is an 
error-prone approach. If the innermost frames are in 
different functions, Aardvark might construct a mulri- 
rrame when the frames are actually misaligned because 
the physical stacks have different depths. As discussed 
in the section on twinning, depth is not a reliable 
alignment mechanism either. Starting at the outer- 
most frames follows the temporal order of calls and 
also correctly handles recursive procedures. The dis- 
advantage of starting at the outermost frames is that 
each physical thread's entire stack must he determined 
before the logical stack can be constructed. Usually 
the programmer only wants the innermost few frames, 
so time delays in the construction process can reduce 
the ease of use of the debugger. 1 ' 

Much of the time, the physical stack frames are at 
the same place because the SPMD nature of HPF 
causes the physical threads to have the same control 



flow. When a procedure is called, each thread executes 
the call and executes it from the same place. A logical 
breakpoint is reached when the physical threads are 
stopped at the same place at the corresponding physi- 
cal breakpoints. These cases lead to synchronized 
frames. The most common cause of an unsvnehronized 
frame is interrupting the program during a computa- 
tion. Fa en in this case, the divergence is usually not very 
large. One reason for a multi frame is the interruption of 
the program w hile it is communicating data between 
processes. In this case, the code paths can easily diverge, 
depending on which threads are sending, which are 
receiving, and ho\\ r much data needs to be moved. 
Scalar frames are created because of the semantic flow of 
the program: the main program unit is written in either 
a serial language or an HPF procedure called an 
EXTRINSICCHP F_S E R I A L ) procedure tvpe. 

The result of the alignment algorithm is a set of 
frames collected into a call stack. The normal naviga- 
tion operations (e.g., up and down) apply. Variable 
lookup and expression evaluation work as expected, 
also. Variable lookup works best for synchronized 
frames and, for HPF, works for unsynchronized frames 
as well. For multi frames, variable lookup generally fails 
because a variable name V A R may resolve to different 
program variables in the corresponding physical 
frames or may not resolve to anything at all in some 
frames. This failure is not because of a lack of informa- 
tion from the compiler but rather because multiframes 
arc generally not a context in which a string VA R has a 
we 11 - d e f i n ed s em a n ti c . 

F'xpcricnce to date suggests that multi frames are of 
interest largely to the people developing the run-time 
support for data motion. Nevertheless, the point of 
transition from synchronized to unsynchronized to 
multi tells the user where control flows diverged, and 
this information can be very valuable. 

Narrowing Focus 

Using the previously mentioned techniques sometimes 
results in a cluttered view r of the state of the entire pro- 
gram and difficulty in finding relevant information, 
Aardvark provides two ways to help. The first aid is a 
Boolean focus mask that selects a subset of the processes 
and then re-applies the logical algorithms. For properly 
chosen subsets, this can turn a stack trace with many mul- 
tiframes into a stack trace with synchronized frames. 
A narrowed focus can also look behind the scenes of the 
twinning mechanism described in the next paragraph. 
The second aid is to view a single physical process in 
isolation, effectively turning off the parallel debugging 
algorithms. This technique is useful for debugging 
EXTRINSIC (HP F_L0 CAD and EXTR I NS I C ( HPF_SERI AD 
procedures. The ability to retrieve the physical processes 
from a logical process is the major item that enables slew- 
ing a process in isolation; as mentioned before, physical 
entities are first-class objects. 
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Twinning 

DIGITAL'S HPF provides a feature called twinning 
in which a scalar procedure can call a parallel HPF 
procedure. This allows, for example, the main pro- 
gram consisting of a user interface and associated 
graphics to be written in C and have Forr.ran/HPF 
do the numerical computations. The feature is called 
twinning because each Fortran procedure is com- 
piled twice. The scalar twin is called from scalar code 
on a designated process. Its duties include instruct- 
ing the other processes to call the scalar twin, distrib- 
uting its scalar arguments according to the HPF 
directives, calling the HPF Hcih from all processes, 
distributing the parallel data back onto the desig- 
nated process after the HPF rwin returns, and finally 
returning to its caller. The HPF twin is called on all 
processes with distributed data and executes the 
user-supplied body of the procedure. 

At the run -time level, die program's entry point is 
nor mall v called on a designated process (process 0), 
and the other processes enter a dispatch loop waiting 
for instructions. Conceptually, such a program starts 
in scalar mode and at some point transitions into paral- 
lel mode. An HPF debugger should represent this 
transition. Aardvark accomplishes this by having 
know ledge of the HPF twinning mechanism. When it 
notices physical threads entering the dispatch loop, 
Aardvark creates a scalar logical frame corresponding 
to the physical frame on process 0. It then processes 
procedure calls on process 0 only, creating more scalar 
frames, until it notices that the program transitions 
from scalar to parallel. This transition happens when 
all processes call the same (scalar twin) procedure: 
process 0 does so as a result of normal procedure calls; 
processes other than 0 do so from their dispatch loops. 
At this point, a logical frame is constructed that will 
likelv be synchronized, and the frame processing 
described previously applies. The result is the one 
desired: a scalar program transitions to a parallel one. 

DIGITAL'S HPF goes a step further: it allows 
EXTRINSIC (HP F_S E R I A L ) procedures to call HPF 
code by means of the twinning mechanism. When an 
EXTRINSIC(HPF_SERIAI_) procedure is called, 
processes other than 0 call the dispatch loop. When 
the scalar code on process 0 calls the scalar twin, the 
other processes are in the necessary dispatch loop. 
Aardvark tracks these calls in the same way as in the 
previous paragraph, noticing that processes other than 
0 have called the dispatch loop and eventually call a 
scalar rwin. 

User Interface Implications 

User interfaces and other clients must be keenly aware 
of the concept of logical frames and the different types 
of logical frames. Depending on the type of frame, 
some operations, such as obtaining the function name 



or the line number, may not be valid. Nevertheless, a 
user interface can provide useful information about 
the state of the program. The program used for the 
following discussion has a serial user interface written 
in C and uses twinning to call a parallel HPF procedure 
named HPF_F I ll_in_data (see Figure 3). The 
HPF procedure uses a function named M A N D E L_V A L 
as a non-data-parallel computational kernel. The pro- 
gram was run on ri ve processes. (Twinning is a DIGITAL 
extension. Most HPF programs are written entirely in 
HPF. This example, which uses twinning, was chosen 
to demonstrate the broader problem.) 

Figure 4 shows the program interrupted during 
computation. Line 2 of the figure contains a single 
function name, M A N D E L_V A L. Line 3 contains the 
function's source file name but lists five line numbers, 
implying that this is an unsynchronized frame. In fact, 
the user interface discovered that Aardvark created an 
unsynchronized logical frame. Instead of trying to get 
a single line number, the user interface retrieved the 
set of line numbers and presented them. In lines 4 
through 10, the user interface also presented the range 
of source lines encompassing the lines of all the com- 
ponent processes. This user interface's up command 
(line 2 1 ) navigates to the calling frame. In this exam- 
ple, the frame is svnehronized, causing the user inter- 
face to present the function's source file and single line 
number (line 26), followed by the single source file 
line (Jine 27). 

Figure 5 shows a summary of die program's call stack 
when it was interrupted during computation. The sum- 
mary is a mix of unsynchronized, synchronized, and 
scalar frames. Frame #0 (Jine 2) is unsynchronized, and 
the various line numbers are presented. Its caller, frame 
#1 (line 3), is synchronized with a single line number. 
All this is consistent with the previous discussion. Frame 
#1 is the H PF twin of the scalar twin in frame #2. The 
scalar rwin of frame #2 is expected to be called by scalar 
code, confirmed by frames #3 and #4. Frame #5 is part 
of the twinning mechanism; process 0 is at line 499, 
while the other processors are all at line 506. 

Narrowing the focus to exclude process 0 shows a 
different call stack summary (lines 9 through 16 of 
Figure 5). The new frame #0 (line 11) continues to be 
unsynchronized, but all the other frames are synchro- 
nized. The twinning dispatch loop (Jine 14) replaces 
the scalar frames of the global focus (lines 5 and 6). 
This replacement causes the new call stack, corre- 
sponding more closely to the physical threads, to have 
fewer frames than the global call stack. 

Interrupting the program while idle within the user 
interface shows more about twinning and also shows a 
multiframe (see Figure 6). Most of the frames are 
scalar except for the twinning mechanism (frame #7, 
line 9) and the initial run -time frame (frame #8, line 
10). Narrowing the focus to exclude process 0 shows 
the twinning mechanism while waiting. The twinning 
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subroutine h p f _f i L L_i n_d a t a ( t a r g e t , w, h, ccr, cci, cstep, nmin, nmax) 

integer, intent(in) :: w, h 

byte, intent(out) :: target(w,h) 

rea 1*8, intent(in) :: ccr, cci, cstep 

integer, intent(in) :: nmin, nmax 

!hpf$ distribute targetC*, cyclic) 

integer :: cx, cy 
c x = w / 2 
cy = h/2 

foraLLCix = 1:w, iy = 1:h) & 
t a r g e t ( i x , i y ) = m a n d e l_v a I ( C M P L X ( c c r + ( ( i x - c x ) * c s t e p ) , & 

cci + ( ( i y- c x ) * c s t e p ) , & 
KIND=KIND(0.0D0)), & 
nmin, nmax) 

contains 

pure byte function m a n d e l_v a L ( x , nmin, nmax) 
comp I ex ( KIND = K I ND ( 0 . 0D0) ) , intent(in) :: x 
integer, intent(in) :: nmin, nmax 

integer : : n 

r ea L ( k i n d = K I N D ( 0 . 0 D 0 ) ) :: xorgr, xorgi, xr, xi, xr2, xi2, rad2 
logical ::keepgoing 

n = -1 

xorgr = REAL(x) 
xorgi = AIMAG(x) 
xr = xorgr 
x i = xorgi 

do 

n = n + 1 
xr2 = xr*xr 

x i 2 = x i * x i 

xi = 2*(xr*xi) + xorgi 
keepgoing = n < nmax 
rad2 = xr2 + xi2 

xr = xr2 - xi2 + xorgr 

if (keepgoing .AND. (rad2 <= 4.0)) cycle 
exit 
end do 

if (n >= nmax) then 

mandel_val - nmax-nmin 
else 

mandel_val = M0D(n, nmax-nmin) 
end if 

end function mandel_val 
end subroutine h p f _f i I l_i n_d a t a 



Figure 3 

H P F_F I LL_I N_D A T A Procedure (Source ("ode for Figures 4 and 5 ) 



mechanism at frames #5 and #6 (lines 23 and 24) is 
similar to the mechanism at frames #3 and #4 (lines 14 
and 15) of Figure 5. In Figure 6, thev do not call a 
scalar twin but rather call the messaging library to 
receive instructions from process 0. The messaging 
library, however, is often not synchronized among the 
peers, and frame #2 ( line 1 5 ) shows a multiframe. This 
user interface shows a multiframe as a collection of 
one-line summaries of the physical frames (lines 16 
through 21). 



Examining HPF Data 

Examining data generally involves determining where 
the data is stored, fetching the data, and then present- 
ing it. HPF presents difficulties in all three areas. 
Determining where data is stored requires rich and 
flexible data- location representations and associated 
operations. Fetching small amounts of data can be 
done naively, one element at a time, but for large 
amounts of data, e.g., data used for visualization, faster 
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1 


Thread is interrupted. 


2 


#0: MANDEL_VAL ( X - <<differing COMPLEXC KIND=8) values>>, NMIN = 255, NMAX = 510) 


3 


at mb.hpf.f90 


:45,44,45,40,39 


4 


39 


xr2 = xr*xr 


5 


40 


x i 2 - x i *x i 


6 


41 


xi = 2*(xr*xi) + xorgi 


7 


42 


keepgoing = n < nmax 


8 


43 


rad2 = xr2 + xi2 


9 


44 


xr = xr2 - xi2 + xorgr 


10 


45 


if (keepgoing .AND. (rad2 <= 4.0)) cycLe 


11 






1 2 


debugger> print x 




1 3 


$1 = #<DIFFERING-VALUES 


1 4 


#0: ( 


-0.66200000000000003,-0. 1 1 4) 


1 5 


#1 : ( 


- 0 . 59599999999999997,-0 .113 ) 


1 6 


#2: ( 


-0.65300000000000002,-0.112) 


1 7 


#3: ( 


-0.93799999999999994,-0.10600000000000001 ) 


1 8 


#4: ( 


-0. 56600000000000006,-0.1 1 ) 


1 9 


> 




20 






21 


debugger> up 




22 


#1: hpf Shpf_f i I L_i n_da t a_( TARGET = «non-atomic= I NTEGERC KIND=1 ) , DIMENSION ( 1 : 400, 1: 


23 




W = 400, H = 400, 


24 




CCR = -0.76000000000000001, CCI = -0.02, CSTEP = 0.001, 


25 




NMIN = 255, NMAX = 510) 


26 


at mb.hpf.f90 


:14 


27 


14 


foraLLCix = 1 :w, iy = 1 :h) & 


28 






29 


debugger> info address target 


30 


#<Locati ve_to_ 


hpf_section 5 peers of type INTEGER(KIND=1 ), DIMENSI0NC1 : 400,1 : 400) > 


31 


type 


INTEGER ( KIND=1 ) , D IMENSI0N ( 1 : 400, 1 :400) 


32 


phy s_count 


5 


33 


addresses 




34 


0: 


0x1 1 f f f 71 f 0 


35 


1 : 


0x1 Iff f7000 


36 


2: 


0x1 Iff f7000 


37 


3: 


0x1 Iff f7000 


38 


4: 


0x1 Iff f7000 


39 


a rank 


2 


40 


t rank 


2 


41 


d i m i nf os 


dLower dupper pLower pupper ... dist_k 


42 


0 


1 400 1 400 ... coLLap 


43 


1 


1 400 1 80 . . . eye L i c 


44 






45 


debugger> info address target C 1 00,1 00) 


46 


#<l oca t i ve_i n_ 


peer in peer 4 ... > 


47 


type 


INTEGER ( KIND=1 ) 


48 


peernum 


4 


49 


Locative 


#<Loca t i ve_to_memory at dmem address 0x11fff8e13 of type INTEGERCKIND 



Figure 4 

Program Interrupted during Computation 



methods arc needed. Displaying data can usually use 
the techniques inherited from the underlying Fortran 
90 support, but some mechanism and corresponding 
user interface handling is needed when replicated data 
has different values. 

Data-Location Representations 

Representing where data is stored is relatively easy 
to do in languages such as C and Fortran 77: the data 
is in a register or in a contiguous block of memory. 
Fortran 90 introduced assumed -shape and deferred- 
shape arrays, 1:1 where successive array elements are not 
necessarily adjacent in memory. HPF allows the array 



to be distributed so that sticcessive array elements are 
not necessarily stored in a single process or address 
space. These lead to data that can be stored discon- 
tiguous!}' in memory as well as in different memories. 

Fortran 90 also introduced array sections, vector- 
valued subscripts, and field -of-arrav operations, 14 
which further complicate the notion of w here data is 
stored. Although evaluating an expression involving 
an array can be accomplished by reading the entire 
array and performing the operations in the debugger, 
this approach is inefficient, especially for a result that is 
sparse compared to the entire array. A standard tech- 
nique is to perform address arithmetic and fetch only 



Digital Technical Journal 



Vol. 9 No. 3 1997 



1 


debugger> where 






2 


> #0(unsync) 


MANDE L_VAL at mb . hp f . f 90 : 4 5 , 44 , 4 5 , 40 , 3 9 




3 


#1 (synchr) 


hpf Shpf_f i L L_i n_data_ at mb. hpf . f 90 : 1 4 




4 


#2 ( sy nch r ) 


hpf_f i I L_i n_data_ at mb . hpf . f 90 : 1 




5 


#3 ( sc a L a r ) 


mb_f i L l_i n_data at mb.hpf.c:45 




6 


#4 ( sc a L a r ) 


main at mb . c : 42 1 




7 


#5 ( unsyn c ) 


_hpf_t w i nn i ng_ma i n_usurper at [...]/ 1 i bhpf/hpf_twi n . 


c:499,506,506,506,506 


8 


#6 ( sy nch r ) 


start at C . . . ] / a L pha / c r 1 0 . s : 36 1 




9 


debugger> focus 


1-4 




1 0 


debugger> where 






1 1 


> #0(unsync) 


MANDE L_V A L at mb . h p f . f 90 : <none> , 4 4 , 4 5 , 40 , 39 






#1 (synchr) 


u _ .f q; l. ^ .f .f ■ i i ■ ^ j.i.. _ «- _ u Ur-.-f -fOn-1/. 

n p t j> n p t t i 1 1 i n a a t a at itid . npi . Tyu . i 4 




13 


#2(synchr) 


hpf_f i L L_i n_data_ at mb . hpf . f 90 : 1 




U 


#3(synchr) 


_hpf_non_pee r_0_to_d i spa t ch_Loop at C. . .3/Libhpf/hpf 


_twi n . c : 575 


15 


#4(synchr) 


hpf t w i n n i ng ma i n usurper at C . . . ] / I i bhpf / hpf_t w i n . 


c:506 


16 


#5(synchr) 


start at C . . . ] / a Lph a / c r 1 0 . s : 361 





Figure 5 

Control Flow of' a Twinned Program Interrupted during Computation 



1 debugger> where 

2 > #0(scalar) poLL at <<unknown name>>:41 

3 #1(scaLar) <<di sembod i ed>> at «un known>> : 4 59 

4 #2(scalar) _XRead at <<unknown name>>:1110 

5 #3(scaLar) _X R eadE ven t s at <<unknown name>>:950 

6 #4(scaLar) XNextEvent at <<unknown name>>:37 

7 #5(scaLar) Hand I eX I nput at mb.c:58 

8 #6(scaLar) main at mb.c:452 

9 #7(unsync) _hpf_t w i nn i ng_ma i n_usurper at C . . . J / L i bh p f / h pf _t w i n . c : 499 , 506, 506 , 506, 506 

10 #8(synchr) start at C . . . ] / a Ipha / c r tO . s : 361 

11 debugger> focus 1-4 

12 debugger> where 

13 > #0(unsync) select at <<unknown name>> : <none> , 4 1 , <none> , 4 1 , 4 1 

14 #1(unsync) TCP_MsgRead at [...]/ I i bhpf /msg t cp . c : <none> , 1 05 7, <none> , 1 057, 1 057 

15 #2(multi) 

16 <none> 

17 _TCP_RecvAvai L at [...]/ L i bhpf /msg t cp . c : 1 400 

18 swtch_pri at <<unknown name>>:118 

19 _TCP_Rec vAva i I at [...]/ L i bhpf /msgtcp . c : 1 400 

20 _TCP_RecvAvai L at [...]/ L i bhpf /msgtcp . c : 1 400 

21 #3(unsync) _hpf_Recv at C . . . J / L i bhpf /msgmsg . c : <none>,434,488,434,434 

22 #4(synchr) _hpf_RecvDir at [...]/ L i bhpf /msgmsg . c : 509 

23 #5(synchr) _hpf _non_pee r_0_t o_d i spa t c h_Loop at [...]/ L i bhpf / hpf_twi n . c : 563 

24 #6(synchr) _hpf _twi nni ng_ma i n_usurper at [...]/ L i bhpf /hpf_twi n . c : 506 

25 #7(synchr) start at C . . . ] / a Lp ha / c r 1 0 . s : 361 



Figure 6 

Control Flow of a Twinned Program Interrupted While Idle in Scalar Mode 



the actual data result at the end of the operation. The 
usual notion of an address, however, is that it describes 
the start of a contiguous block of memory. 

Richer data- location representations are necessarv. 
These representations can include registers and con- 
tiguous memory, but they also need to include discon- 
tiguous memory and data distributed among multiple 
processes. The representations should also include the 
results of expressions involving array sections, vector- 
valued subscripts, and field -of- array operations, 
thereby extending address arithmetic to data-location 
arithmetic. Aardvark defines a locative base class that 
has a virtual method to fetch the data. A variety of 



derived classes implement the data-location represen- 
tations needed. 

DIGITAL'S Fortran 90 implements assumed -shape 
and deferred -shape arrays using descriptors that con- 
tain run-time information about the memory address 
of the first element, the bounds, and pcr-dimension 
intcr-clemcnt spacing.'" Aardvark models these types 
of arrays almost directly with a derivation of the loca- 
tive class that holds the same information as the 
descriptor. Performing expression operations is rela- 
tively easy. An array section expression adjusts the 
bounds and the inter-element spacing. A field-of array 
operation offsets the address to point to the compo- 
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ncnt field and changes the element type to that of the 
field. A vector-valued subscript expression requires 
additional support; the representation for each dimen- 
sion can be a vector of memory offsets instead of 
bounds and inter-element spacing. 

All arrays in HPF arc qualified, explicitly or implic- 
itly, with ALIGN, TEMPLATE, and DISTRIBUTE direc- 
tives. 16 DIGITAL'S HPF uses a superset of the Fortran 
90 descriptors to encode this information. Aardvark 
models HPF arrays with another derivation of the 
locative class that holds information similar to the HPF 
descriptors. The most pronounced difference is that 
Aardvark uses a single locative to encode the descrip- 
tors from the set of processes. Aardvark knows that the 
local memory addresses arc potentially different on 
each process and maintains them as a vector, but cur- 
rently assumes that processor-independent informa- 
tion is the same on all processes and only encodes that 
information once. 

Referring again to Figure 4, line 22 shows that the 
argument TARGET is an array, and line 29 is a request 
for information about the location of its data. (Sec also 
Figure 3 for the full source, including the declaration 
and distribution of TARGET.) Figure 4, line 32 shows 
that there are five processes, and lines 34 through 38 
show the base address within each process. The 
addresses for processes 1 through 4 happen to be the 
same, but the address for process 0 is different. Lines 
39 and 40 show that the ran k of the array (arank) and 
the rank of the template ( t rank) are both 2. Lines 42 
and 43 show the dimension information for the array. 
The declared bounds are 1 : 40 0,1 :40 0, but the local 
physical bounds arc 1 : 4 00, 1 : 80 and the distribu- 
tion is ( * , C Y C L I C ) . This is all accu rate; distributing 
the second dimension on five processes causes the 
local physical size tor that dimension (80) to be one- 
fifth the declared bound (400). 

Performing expression operations on HPF-based 
locatives is more involved than for Fortran 90. 
Processing a scalar subscript not only offsets the base 
memory address but also restricts the set of processors 
determined by the dimension's distribution information. 
Processing a subscript triplet, e.g., from:to:stride, 
involves adjusting the declared bounds and the align- 
ment; it does not adjust the template or the physical lay- 
out. As in Fortran 90, processing a vector-valued 
subscript in HPF requires the locative to represent the 
effect of the vector. For HPF, the representation is pairs 
of memory offsets and processor set restrictions. 
Processing a fie I d-of-array operation adjusts the element 
type and offsets each memory address. 

When selecting a single array element by providing 
scalar subscripts, another type of locative is useful. This 
locative describes on which process the data is stored 
and a locative relative to that selected process. For 
example, line 45 of Figure 4 requests the location 
information of a single array element. The result 



shows that it is on process 4 at the memory address 
indicated by the contained locative. 

Fetching HPF Data 

As just mentioned, locatives provide a method to fetch 
the data described by the locative. For a locative that 
describes a single distributed array clement (e.g., 
Figure 4, lines 45 through 49), the method extracts 
the appropriate physical thread from the logical thread 
and uses the contained locative to fetch the data rela- 
tive to the extracted physical thread. For a locative that 
describes an HPF array, Aardvark currently iterates 
over the valid subscript space, determines the physical 
process number and memory offset for each element, 
and fetches the clement from the selected physical 
process. For small numbers of elements, on the order 
of a few dozen, this technique has acceptable per- 
formance. For large numbers of elements, e.g., for 
visualization or reduction operations, the cumulative 
processing and communication delav to retrieve each 
individual element is unacceptable. This performance 
issue also exists for locatives that describe discontigu- 
ous Fortran 90 arrays. The threshold is higher because 
there is no computation to determine the process for 
an element, and the process is usually local rather than 
remote, eliminating communication delays. 

The primary bottleneck is issuing many small data 
retrieval requests to each (remote) process. This 
involves many communication delays and many delays 
related to retrieving each element. What is needed is to 
issue a smaller number of larger requests. The smaller 
number reduces the number of communication trans- 
actions and associated delays. Larger requests allow 
analysis of a request to make more efficient use of 
the operating system's mechanisms to access process 
memory. For example, a sufficiently dense request can 
read the encompassing memory in a single call to the 
operating system and then extract the desired ele- 
ments once the data is within the debugger. 

Although not implemented, the best solution, in 
my opinion, is to provide a u read (multidimensional) 
memory section" method on a process in addition to 
the common "read (contiguous) memory" method. If 
the process is remote, as it usually is with HPF, the 
method would be forwarded to a remote debug server 
controlling the remote process. The implementation 
of the method that interacts with the operating system 
would know the trade-offs to determine how to ana- 
lyze die request for maximum efficiency. 

Converting a locative describing a Fortran 90 array 
section to a "read memory section" method should be 
easy: they represent nearly the same thing. For a loca- 
tive that describes a distributed HPF array, Aardvark 
would need to build (physical) memory section 
descriptions for each physical process. This can be 
done by iterating over the phvsical processes and 
building the memory section for each process. It is 
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also possible to build the memory sections for all the 
processes during a single pass through the locative, but 
the performance gains may not be large enough to 
warrant the added complexity. 

Differing Values 

Using HPF to distribute an array often partitions its 
elements among the processes. Scalars, howev er, are 
generally replicated and may be expected to have the 
same value in each process. There are cases, though, 
where sccminglv replicated scalars mav n#t have the 
same value. DO loops that do not require data to be 
communicated between processes do not have syn- 
chronization points and can become out of phase, 
resulting in their indexes and other priv atized variables 
having different values. Functions called within a 
FORALL construct often run independently of each 
other, causing the arguments and local variables in 
one process to be different from those in another. 
A debugger should be aware that values might differ 
and adjust the presentation of such values accordingly. 

Aardvark's approach is to define a new kind of value 
object called differing ra/ucsxo represent a value from 
a scmanticallv single source that does not have the 
same value from all its actual sources. A user interface 
can detect this kind of value and display it in different 
ways, for example, based on context and/or the size of 
the data. 

Referring again to Figure 4, the program was inter- 
rupted while each process was executing the function 
M A N D E L_V A L called within a FORALL. fine 2 shows 
that the argument X was determined to have differing 
values. This user interface does not show all the values 
at this point; a large number of values could distract 
the user from the current objective of discovering 
where the process stopped. Instead, it shows an indica- 
tion that the values arc different along vv ith the type of 
the variable. Notice that the other two arguments, 
NMIN and NMAX, arc presented as integers; they have 
the same value in all processes. Line 12 requests to see 
the value of X. Line 13 again shows that the values air 
different, and lines 14 through 18 show the process 
number and the value from the process. 

To build a differing values object, Aardvark reads 
the values for a replicated scalar from each process. If 
all the values are bit-wise equal, they are considered to 
be the same and a standard (single) value object is 
returned. Otherwise, a differing values object is con- 
structed from the several values. For numeric data, this 
approach seems reasonable. If the value of a scalar inte- 
ger variable INTVAR is 4 on all the processes, then 4 is 
a reasonable (single) value for INTVAR. If the value of 
INTVAR is 4 on some processors and 5 on others, no 
single value is reasonable. For nonnumeric data and 
pointers, there is the possibilitv of false positives and 
false negatives. The ideal for user-defined types is to 
compare the fields recursively. Pointers that are scman- 



ticallv the same can point to targets located at different 
memory addresses for unrelated reasons, leading to 
different memory address values and therefore a false 
positive. To correctly dereference the pointers, though, 
Aardvark needs the different memory address values. 
In short, it is reasonable to test numeric data and cre- 
ate a single value object or a differing values object, 
and it appears reasonable to do the same for nonnu- 
meric data, despite the possibility of a technically false 
kind of value object. 

Currently, differing values do not participate in arith- 
metic That is, the expression I N TV AR . LT . 5 is valid if 
INTVAR is a single value but causes an error to be sig- 
naled if INTVAR is a differing value. Many cases could 
be made to work, but some cases defy resolution. In the 
INTVAR. LT. 5 case, if all values of INTVAR are Jess than 
5 or all are greater than or equal to 5, then it is reason- 
able to collapse the result into a single value, .TRUE. #r 
. FALSE., respectively. If some values are less than 5 
and some are not, it also seems reasonable to create a 
differing values object that holds the differing results. 
What if I NT V A R . LT . 5 is used as the condition of a 
breakpoint and some values of INTVAR are less than 5 
and some are not? The breakpoint should probably 
cause the process (and all the physical processes) to 
remain stopped. It is unclear whether arithmetic on 
differing values would be useful to users or if it would 
lead to more confusion than it would clear up. 

Unmet Challenges 

HPF presents a variety of challenges that Aardvark 
does not yet address. Some of these challenges are not 
in common practice, giving them low priority. Some 
are recent with HPF Version 2.0 and are being used 
with increasing frequency. Some of the challenges, for 
example, a debugger-initiated call of an HPF proce- 
dure, are tedious to address correctly. 

Mapped Scalars 

It is possible to distribute a scalar so that the scalar is not 
Mill \- replicated. 1 he compiler would need to emit suffi- 
cient debugging information, which would probably be 
a virtual array descriptor with an array rank of 0 and a 
nonzero template rank. Aardvark would probably model 
it using its existing locative for HPF arrays, also with an 
arrav rank of 0 and appropriate template information. 

Replica ted Arrays 

Unless otherwise specified, DIGITAL'S HPF compiler 
replicates arrays. It is possible to replicate arrays explic- 
itly and to align arrays (and scalars so that they arc 
partially replicated. Currently, Aardvark does not 
detect a replicated array, despite the symbol table or 
run-time descriptor indicating that it is replicated. Asa 
result, Aardvark determines a single process from 
w hich to fetch each arrav element. For fully replicated 
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arrays, Aardvark should read the array from each 
process and process them with the differing values 
algorithms. Correctly processing arrays that are par- 
tially replicated is not as easy as processing unrepli- 
cated or fully replicated arrays. If the odd columns are 
on processes 0 and 1, while the even columns are on 
processes 2 and 3, no single process contains the entire 
array. The differing values object would need to be 
extended to index the values by a processor set rather 
than a single process. 

Update of Distributed and Replicated Objects 

Aardvark currently supports limited modification of 
data. It supports updating a scalar object (scalar vari- 
able or single array element) with a scalar value, even if 
the object is distributed or replicated. Even this can be 
incorrect at times. Assigning a scalar value to a repli- 
cated object sets each copy, which is undesirable if the 
object has differing values. Assigning a value that is a 
differing values object is not supported. More impor- 
tantly (and more subtly), Aardvark is not aware of 
shadow or halo copies of data that are stored in multi- 
ple processes, so updating a distributed object updates 
only the primary location. 

Distributed Array Pointers 

HPF Version 2.0 allows array pointers in user-defined 
types to be distributed and allows fully replicated 
arrays of such types. For example, in 

type utype 

integer, pointer :: compptrC:) 
!hpf$ distribute compptr(block) 

end type 

type (utype) :: scalar, array (20) 

the component Held compptr is a distributed array 
pointer, Aardvark does not currently process the array 
descriptor(s) for s c a I a r % compp t r at the right place 
and as a result does not recognize the expression as 
an array. As mentioned earlier, Aardvark reads a repli- 
cated array element from a single process. To process 
a r r a y ( 1 ) % c ompp t r, all the descriptors are needed, 
e.g., for the base memory addresses in the physical 
processes. The use of this relatively new construct is 
growing rapid Iv, elevating the importance of being 
supported by debuggers. 

Ensuring a Consistent View 

A program can have its physical threads stop at the 
same place but be in different iterations of a loop. 
Aardvark mistakenly presents this state as syn- 
chronized and presents data as if it were consistent. 
This is what is happening in Figures 4 and 5; 
h p f $ h p f _f i L L_i n_d a t a (frame #1 ) is in different 
iterations of the FOR ALL. With compiler assistance, it 
is possible to annotate each thread's location with iter- 
ation counts in addition to traditional line numbers. IS 



The resulting set of locations can be compared to a 
location in the conceptually serial program to deter- 
mine \\ hich threads have already reached (and perhaps 
passed) the serial location and which have not yet 
reached it. A debugger could automatically, or under 
user control, advance each thread to a consistent serial 
location. For now, Aardvark's differing values mecha- 
nism is the clue to the user that program state might 
not be consistent. 

Calling an HPF Procedure 

Having a debugger initiate a call to a Fortran 90 pro- 
cedure is difficult in the general case. One difficulty is 
that eopy-in/eopy-out (making a temporary copy of 
array arguments and copying the temporary back to its 
origin after the call returns) may be necessary. HPF 
adds two more difficulties. First, the data may need to 
be redistributed, which amounts to a distributed copy- 
in/copy-out and entails a lot of tedious (but hopefully 
straightforward) bookkeeping. Second, an HPF 
thread's state is much more complex than a collection 
of physical thread states. When a debugger initiates a 
uniprocessor procedure call, it generally saves the reg- 
isters, sets up the registers and stack according to the 
calling convention, lets the process run until the call 
returns, extracts the result, and finally restores the 
registers. The registers are generally the state that is 
preserved across a debugger- initiated procedure call. 
For HPF, and in general for other paradigms that: use 
message passing, it may be necessary to preserve the 
run-time state of the messaging subsystem in each 
process. This preservation probably amounts to mak- 
ing uniprocessor calls to messaging-supplied save/ 
restore entry points, allowing the messaging sub- 
system to define what its state is and how it should 
be saved and restored. Although logical entities would 
be used to coordinate the physical details, this is a lot 
of work and has not been prototyped. 

Related Work 

DIGITAL'S representative to the first meeting of 
the HPF User Group reported a general lament 
among users about the lack of debugger support. '''^ 
Browsing the World Wide Web reveals little on the 
topic of HPF debugging, although some efforts have 
provided various degrees of sophistication. 

Multiple Serial Debuggers 

A simplistic approach to debugging support is to start 
a traditional serial debugger on each component pro- 
cess, perhaps providing a separate window for each 
and providing some command broadcast capability. 
Although this approach provides basic debugging, it 
does not address any of the interesting challenges of 
HPF debugging. 
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Prism 

The Prism debugger (versions daring from 1992), for- 
merly from Thinking Machines Corporation, provides 
debugging support for CM Fortran. 2 The run- time 
model of CM Fortran is essentially single instruction, 
multiple data (SIMD), which considerably simplifies 
managing the program. The program gets compiled 
into an executable that broadcasts macroinstructions 
to the parallel machine, even on the CM-5 synchro- 
nized multiple instruction, multiple data (MIMD) 
machine. Prism primarily debugs the single program 
doing the broadcasting. Therefore, operations such as 
starting, stopping, and setting breakpoints can use the 
traditional uniprocessor debugging techniques. Prism 
is aware of distri buted data. When v isualizing a distrib- 
uted arrav, however, it presents each process's local 
portion and conceptual Iv augments the rank of the 
arrav to include a process axis. For example, a rwo- 
dimensional 400 x 400 array distributed ( *,c Y cli c ) 
on five processes is presented as a 400 x 80 x 5 arrav. 
For explicit message sending programs, Prism controls 
the target processes and provides a "where graph," 
which has some of the visual cues that Aardvark\s logi- 
cal frames provide. 

TotalView 

Recent (1997) versions of the TotalView debugger, 
from Dolphin Interconnect Solutions, Inc., provide 
some support for the HPF compiler from The Portland 
Group, Inc. 2 "- 14 TotalView provides "process groups, 11 
which are treated more like sets for set- wide operations 
than like a svnthesis into a single logical entity. As a 
result, no unified view of the call stacks exists. TotalView 
can "dive" into a distributed H PF array and present it as 
a single array in terms of the original source. Distributed 
data is not currently integrated into the expression 
system, however, so a conditional breakpoint such as 
A(3,4).LT.5 does not work. TotalView is being 
actively developed; tiiture versions vvili likely provide 
more complete support for HPF. 

Applicability to Other Areas 

Many of the techniques rhat Aardvark incorporates can 
apply to other areas, including the single program, 
multiple data (SPMD) paradigm, debugging optimized 
code, and interpreted languages. 

Single Program, Multiple Data 

Logical entities can be used to manage and examine 
programs that use the SPMD paradigm. This is true for 
process-level SPMD, which is commonly used with 
explicit message sending such as MP.I, V ' and for 
thread-level SPMD such as directed decomposi- 
tion. 2 ^ 2 " Aardvark's twinning algorithms can be used 
in both cases. Process-level SPMD is similar to 



D1GTTAI "s HPF; the equivalent of twinning requires a 
stvlisric way of coding and declaring a dispatch loop. 
Thread- level SPMD usually has a pool of threads wait- 
ing in a dispatch loop, requiring Aardvark to know 
some mechanics of the run-time support. 

The differing values mechanism can apply to data in 
SPMD paradigms. DIGITAL'S recent introduction of 
Thread Local Storage (TLS), 2 * modeled on the Thread 
Local Storage facility of Microsoft Visual C++ ' with 
similarities to TASKC0MM ON of Cray Fortran,'" provides 
another source of the same variable having potentially 
differing values in different thread contexts. 

Debugging Optimized Code 

Aardvark s flexible locative subsystem and its aware- 
ness of nonsingular values (i.e., differing values) can be 
the basis for "split-lifetime variables. 11 In optimized 
code, a variable can have several simultaneous lifetimes 
(e.g., the result of loop unrolling) or no active lifetime 
(e.g., between a usage and the next assignment). New 
derivations of the locative class can describe the multi- 
ple homes or the nonexistent home of a variable. 
Fetching by means of such a locative creates new kinds 
of values that hold all the values or an indication that 
there is no value. User interfaces become aware of 
these new kinds of values in ways similar to their 
awareness of differing values. 

AardvaiVs method of asking a thread for a single- 
stepping run reason and empowering the reason to 
accomplish its mission can be the basis for single step- 
ping optimized code. Optimized code generally inter- 
leaves instructions from different source lines, rendering 
the standard "execute instructions until the source line 
number changes 11 method of single stepping useless. 
If instead the compiler emits information about the 
semantic ev ents of a source line, Aardvark can construct 
a single-stepping run reason based on semantic events 
rather than line numbers. Single stepping an optimized 
HPF program immediately reaps the benefits since logi- 
cal stepping is built on physical stepping. 

Interpreted Languages 

Logical entities can be used to support debugging 
interpreted languages such as Java' 1 and Tel/ 2 In this 
case, the physical process is the operating system's 
process (the Java Virtual Machine or the Tel inter- 
preter), and the logical process is the user- level view 
of the program. A logical stack frame encodes a pro- 
cedure call of the source language. This is accom- 
plished by examining virtual stack information in 
physical memory and/or by examining physical 
stack frames, depending on how the interpreter is 
implemented. Variable lookup within the context of 
a logical frame would use the interpreter- managed 
symbol tables rather than the symbol tables of the 
physical process. 
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Summary 

HPF presents a variety of challenges to a debugger, 
including controlling the program, examining its call 
stack, and examining its data, and user interface impli- 
cations in each area. The concept of logical entities can 
be used to manage much of the control complexity, 
and a rich data- location model can manage HPF arrays 
and expressions involving arrays. Many of these ideas 
can apply to other debugging situations. On the sur- 
face, debugging HPF can appear to be a daunting task. 
Aardvark breaks down the task into pieces and attacks 
them using powerful extensions to familiar ideas. 
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